Characterization of Temporal and Acoustic Parameters for Speaker Identification in Disguised Speech
Shivani Sharma*1, Sanjay Kumar Jain2 and Rakesh Mohan Sharma3
1Junior Scientific Officer, Central Forensic Science Lab, India
2Director, Central Forensic Science Lab, India
3Professor, Department of Forensic Science, Punjabi University, India
Submission: August 30, 2017; Published: September 08, 2017
*Corresponding author: Shivani Sharma, Junior Scientific Officer, Central Forensic Science Lab., MHA, Chandigarh, India, Tel: 9463495626; Email: shivanisharma.cfsl@gmail.coin
How to cite this article: Shivani S, Sanjay K J, Rakesh M S. Characterization of Temporal and Acoustic Parameters for Speaker Identification in Disguised Speech. J Forensic Sci & Criminal Inves. 2017; 5(1): 555651. DOI: 10.19080/JFSCI.2017.05.555651
Abstract
Identification of speaker through disguised voice sample is not uncommon but it poses an intricate problem and hurdle in crime case examination. Most of the time, perpetrator is uncooperative because of the fear of detection and always tries to hide his/her identity. In the present study, we have chosen two modes of disguised speech such as freestyle and by keeping handkerchief on mouth. Twenty speakers (10 males and 10 females, aged between 25-45 years) were selected for recording of speech samples. Total 3780 spoken words were subjected to Spectrographic analysis and temporal measures (Speech rate, Articulatory rate, Phonation-Time ratio). The acoustic parameters such as supralaryngeal [Formant frequencies (F1, F2, F3)] and laryngeal [Open Quotient (H1*-H2*), Degree of glottal opening (H1*-A1) and Glottal leakage (B1)] were studied for the identification of the speaker through disguised voice samples. Temporal measures are very effective parameters when speaker is trying to disguise his/her voice. It has been found that Phonation time ratio (P/T) and Open Quotient (OQ) are least affected in case of disguised speech as compared to Formant frequencies (F1, F2, F3), Degree of glottal opening (H1*-A1) and Glottal leakage (B1)
Keywords: Articulatory rate; Phonation-Time ratio; Open Quotient; Supralaryngeal; Laryngeal and Disguised Speech
Introduction
With the fast growth of technology and communication, the criminal activities are also increasing rapidly. The use of latest technology is at every next door, may be for good or evil cause and the limitations of the step- by-step system has been felt increasingly. Criminals are very much aware with the latest technologies and are always ready to beat the surveillance system. Criminals are using telephones, mobiles and satellite phones to communicate or for threatening, hoax and ransom calls and disguise their voice because of the fear of identification. In forensic scenario, speaker identification is laying down the mile stones with remarkable opinions given in many cases of national and political interest but speaker identification in case of disguised voice is still one of the grey areas in this progressive field. Thus, the research work carried out here is concerned with relatively immature area of speaker identification with disguised voice. Speaker identification in case of disguised voice is highly sought-after in forensic world as vocal disguise can potentially modify the acoustical characteristics of an individual voice.
The nature of speech sound is dependent mainly on three factors i.e. Dimensions of vocal tract, Mode of Phonation and Manner of Articulation. There is variation in vocal tract organs with time, health factors and with increasing age, which affect the formant values [1,2]. Although every individual has habitual and learned patterns of phonation and articulation, there would be small variation in each utterance of the same word or text during normal speech known as intra-speaker variation. This amount of variability in fundamental voice frequency F0 has little effect on the linguistic interpretation of an utterance and greater effect on prosodic feature for each utterance [3,4].However, the person can change the manner of articulation and mode of phonation intentionally during disguise. Riech et al. investigated the effects of few selected vocal disguises upon spectrographic speaker identification [5]. He also examined the ability of naïve and sophisticated listeners to detect the presence of particular disguises with a high degree of accuracy and reliability [6]. LTAS (Long Time Average Spectrum) were used by Lindh to identify speakers from closed set of disguised voices [7]. Hollien and Majewskis carried out experiments to study the normal controlled speech samples, under stress and disguised speech condition using Long Term Speech Spectra (LTS). The results demonstrated high level of correct speaker identification for normal speech sample, slightly reduced scores for speech during stress and markedly reduced correct identification for disguised speech samples [8]. Spectrograms are considered as reliable method for speaker identification in normal speech samples [5,9,10].
The previous studies [1,2,5,7,11,12] indicated that disguise indeed increases the intra speaker variation as seen in voiceprints. A lot of valuable studies had been carried out in the respective area but still the problem is as firm as rock. So being not only relying on voiceprints, few more parameters have been investigated such as Temporal measurements (Speech rate, Articulatory rate, Phonation-Time (P/T) ratio), Supralaryngeal measurements i.e. Formant frequencies (F1, F2, F3) and Laryngeal parameter (Open Quotient (H1*-H2*), Degree of glottal opening (H1*-A1) and Glottal leakage (B1)) which are highly characteristics of an individual.
In the present study, voice samples were taken in three different modes, first in normal speech and then speakers were allowed to disguise their voice in freestyle and keeping handkerchief in front of mouth mode. For freestyle disguise, speaker disguise in a manner in which the speaker felt that he would conceal his identity most effectively. Speaker changes the voice to taper off, to creaky, hoarse voice or in falsetto. The third mode of speech recording was using handkerchief placed in front of mouth (covering nose and mouth) by speakers while he/she is speaking.
Experimental Procedure
Experimental procedure is divided into three parts namely: Selection of speech material, & speakers, Recording procedure & environment and Analysis of Results.
Selection of Speech Material
Fifteen isolated words, which are frequently used in normal conversation and having importance in forensic speaker identification, were selected. Further using these isolated words, five contextual sentences were framed (Table 1). These isolated words and contextual sentences were given to each speaker to utter three times. These utterances were recorded in normal mode as well as two modes of disguised speech (freestyle and handkerchief in front of mouth).
Selection of Speakers
Twenty speakers (10 males and 10 females, aged between 25-45 years) having normal speaking habits were asked to read the selected speech material normally and in disguised modes (freestyle and handkerchief in front of mouth) as far as possible. It is pertinent to mention here that none of the speaker was professional imitator.
Recording Procedure and Environment
The recordings were made in sound treated room of Central Forensic Science Laboratory (CFSL), Chandigarh. The speech samples were recorded directly on computer equipped with CUBASE software through Sennheiser microphone model HMD25-1 600Ω at normal room temperature. The speech samples were recorded at the sampling rate of 44.1 KHz frequency and 16-bit quantization. Hence, a total of 3780 word files in normal and disguised modes have been stored in computer in the following format:
SXX_M_UU.wav
Where, S=Gender {Male (M) /Female (F)}
XX=Speaker ID {From 01 to 20}
M= Mode {Normal (0), Disguise Free Style (1) and Putting Handkerchief on mouth (2)}
UU= utt erance
These digitized speech samples were scale down to 22050 Hz for analysis in Computerized Speech Lab (CSL) Model-4300B, Kay Electronics, USA, available in Central Forensic Science Laboratory (CFSL), Chandigarh.
Acoustic and Temporal Analysis
All acoustic analyses were made using the Computerized Speech Lab (CSL) Model-4300B and Goldwave software. Spectrographic analyses (Wideband and Narrow band), Pitch contour, Energy contour, Supralaryngeal and Laryngeal acoustic parameters were determined through Computerized Speech Lab (CSL) and Temporal measurements were carried by using Gold wave Software. Spectrograms of normal and disguised modes were compared.
Results and Discussion
Spectrograms (SPG) are usually reliable representation of relative vowel quality, strengthening and weakening of stops, frication and aspiration, but there is great deal of individuality in the length and type of aspiration and frication. Spectrographic analysis was carried out for all the utterances in all three modes and the values of first four formants were calculated. The rate of transition of the formant and duration of stops varies from one individual to another. Figure 1 shows the spectrograms of the text ‘Hello I am Fine’ in all three modes i.e. normal, free style and handkerchief in front of mouth mode in window A, B & C respectively. The horizontal intense bands made up from vertical striations represent the formants. The value of frequency at particular formant varies from speaker to speaker. The third and fourth formants are not clearly visible in case of freestyle and handkerchief in front of mouth disguised mode where as these are clearly obtainable in normal speech. It can also be observed that the aspiration and frication parameters in the sentence / hello, I am fine/ is reduced and diminished in case of disguised speech. It is pertinent to mention that /f/ is dento-labial fricative. In the word ‘fine/, frication is followed by good formant pattern formed by the vowel /i/ and nasal sound /n/. The pattern of formant in the word /fine/ doesn’t alter in disguise mode but the values of second and third formant frequencies (F2 and F3) was changed observed in disguised mode as compare to normal mode utterances. To study co-articulation effect of followed vowels, the sound of /H/ was compared in the words “/HELLO/” and “/HAAN/”.
In the word ‘Haan’, /h/ is followed by long nasal/a/ vowel sound, the vocal tract appears to be completely open (lips completely open position) without any restriction from articulators and vocal folds keep on vibrating after the consonant closer. Thus, the duration of aspiration of /h/ in ‘Haan’ is shorter (0.046sec) as compare to the /h/ in ‘hello’(0.064sec) where /h/ is followed by /e/ (little spreading of lips) and little constriction of vocal tract appears to increase the duration of aspiration of /h/ in /hello/ (Figure 2). In the spectrographic study of contextual speech sentence ‘Where are you’ you’ in all three modes i.e. normal mode (window A), freestyle (window B) and by putting handkerchief on mouth (window C) mode (Figure 3), The formant pattern get distorted from normal (window A) speech sample as speaker tried to disguise (freestyle (window B) and by putting handkerchief in front of mouth (window C)). The position of formants shifts (both in frequency and time axis). The retroflex sound of word /r/ of ‘are’ is missing in free style (window B) mode but shift from word ‘are’ to ‘you’ remains unaffected. Similar pattern was observed in other utterances. Though there is variation in spectrogram patterns because of disguise but still there are many factors, which reflect the speaker characteristics, as it was difficult for the speaker to alter all the parameters simultaneously. Pitch contours were plotted and it was observed that there is high probability that speaker would attain his original fundamental frequency during speaking. Energy contours were also plotted, and it has been observed that the energy values of word ‘you’ was also approximately similar in all the utterances.
Temporal Measurements
In present experiment, temporal parameters such as speech rate, syllables per minute, articulation rate and phonation-time ratio were calculated for all the utterances of contextual speech text.
a. Speech rate: Total number of syllables produced in a given speech sample divided by the amount of total time required to produce the sample (including pause time), expressed in seconds. Speech rate has been calculated for all the twenty speakers in normal, freestyle and handkerchief mode. The bar diagrams of averaged speech rate for each speaker in all three modes have been plotted (Figure4). The speech rate varies from one speaker to another speaker as well as between the modes of speaking within the speaker. Similar trends have been observed in case of male and female speakers (Table 2).
b. Syllables per minute: Total number of syllables produced in a given speech sample divided by the amount of total time required to produce the sample (including pause time), expressed in minutes. Figure 5 represents the averaged syllable rate per minute over three utterances of each mode from all the twenty speakers. Syllable rate varies from speaker to speaker as well as in different modes. It was found that speaker can modify its syllable rate per minute and speech rate intentionally and can also increase pause time in between the syllables, which was in agreement with the findings of Kunzel [10] (Table 2).
c. Articulation rate (per minute): Total number of syllables produced in a given speech sample divided by the amount of total time required to produce the sample (excluding pause time). Articulation rate also was calculated for all the contextual speech text in respect of each speaker. The variation in the articulation rate between the utterances is less as compared to speech rate and syllable per minute for all the speakers. It varies in the range approximately 320 to 500 for normal 220 to 550 in disguised mode respectively (Table 2). Articulation rate gave less intra-speaker difference in case of normal and disguise speech samples as compared to speech rate and syllables rate per minute (Figure 6). Thus, articulation rate can be considered as promising parameter for forensic speaker identification.
d. Phonation time (P/T) ratio: Total duration of effective speech divided by total time to produce speech sample. The graphical plot of phonation time ratio for all three modes (normal, freestyle and handkerchief in front of mouth mode) shows the inter- and intra-speaker variation of the utterances. Averaged Phonation-Time ratio (P/T) has been calculated for each speaker in three different modes. The value of phonation-time ratio is almost similar for a speaker irrespective of the mode of speaking. The similar results were obtained in case of each male and female speakers (Figure 7). Hence, Phonation Time ratio was least affected as compared to speech rate, syllables per minute and articulation rate in case of normal and disguised modes (Table 2). The Phonation-Time ratio is found to be good predictor of fluency of a speaker by which speaker can be characterized. The results obtained are in the agreement with the earlier results of studies conducted by Lennon, 1990; Towell et al 1996 [13,14].
Supralaryngeal Acoustic Parameter
A speech sound created solely by vocal folds vibrations as the
sound source will have its phonetic contents mainly determined
by the first three formants; F1, F2, F3 and the relative distances
between them. These are determined by the manner and position
of articulation, so the structure of the formant frequencies will
vary with each sound. We have calculated the first four-formant
frequencies (F1 F2, F3 & F4) for the isolated and contextual text
from the FFT and LPC spectrum for all male and female speakers.
In disguised speech samples, higher formant frequencies
(F3 & F4) show larger variation as compare to lower formant
frequencies (F1 & F2). It has also been observed that second
formant frequencies (F2) is affected more as compared to first
formant frequencies (F1). Some speakers has tendency to shift
second formant frequency closer to third formant frequency
(Figure 8). Standard deviation was calculated to study the
deviation on four formant frequencies values for all the twenty
speakers in normal and disguised modes. The values for first
four formants were averaged over all utterances in respect
of each speaker for the selected text. The values of standard
deviation were calculated for all twenty speakers (Table 3). It
has been observed fourth formant frequency is most affected
parameter whereas first formant frequency is least affected in
case of disguised speech when compared w.r.t. normal speech.
The order of variation was found to be: F1 In speech production, Larynx acts as a phonatory mechanism,
transforming the airflow from the respiratory system into
waveforms. Phonation types refer to the activity of the larynx
and could be considered as laryngeal voice quality. Voice quality
consists of physically induced voice characteristics and vocal
settings, and that both make use of the similar acoustic parameters
and physical characteristics lies in a speaker’s voice quality.
Acoustic measurement of laryngeal voice quality method was first
introduced by Fischer-Jørgenson in 1967 [12,15]. In addition to
subpharyngeal acoustic parameters, we have made an attempt to
study the Laryngeal acoustic parameters such as Open Quotient
(H1*-H2*), Degree of glottal opening (H1*-A1) and Glottal leakage
(B1) on the selected text. The calculated values of each parameter
have been plotted to distinguish between different speakers even
when they are trying to disguise their voices. a. Open quotient (OQ): Open quotient (OQ) can be
defined as the part of the glottal cycle during which the
glottis is opened. In modal phonation the vocal folds are
open during half of each glottal cycle and closed for the other
half approximately. The duration of vocal folds of being open
and closed changes during disguise and the value of open
quotient depends on the type of disguise also. Figures 9(a) &
(b) represents the plot of open quotient for female and male
speakers respectively. It is clear from the figures that every
speaker has individualistic pattern, which is independent of
type of disguise and type of text spoken. The relationship
between the OQ and the harmonics of a periodic signal
is evident from Fourier analysis; as the OQ increases, the
amplitudes of the higher harmonics decrease. Male speaker
have a relatively shorter pulse than female and hence a
smaller OQ such that the higher harmonics remain relatively
strong, opposed to the relatively weak higher harmonics for
women [15-17]. Hence, OQ can be considered as significant
parameter to differentiate the speakers. b. Glottal leakage (B1): Glottal Leakage (B1) may be
defined as the bandwidth of the first formant. Speaker can
shift their formant during disguise. The formant frequencies
can also increase or decrease depending upon type of
disguise. The first formant bandwidth varies with type of
disguise as well as type of text spoken, but the variation is
very less in case of some of the speakers which is in the range
0-5 (Hz) in both disguise and normal modes of speaking.
Further the value of Glottal Leakage ranges 350-600 (Hz) in
case of females and 250-450 (Hz) in case of male speakers
respectively. The range of variation varies from speaker to
speaker differently (Figure10). It has been observed that
the glottal leakage is quite similar for normal and disguise
modes in case of some of speakers and this can be used to
differentiate among the speakers. c. Degree of glottal opening (H1*-A1): Degree of glottal
opening (H1*-A1) primarily reflects the laryngeal voice
quality. As we know, vocal fold adduction is largely a
function of posterior cricoarytenoid muscle action, and the
opening of glottis is usually greater in voiceless mode than
in any other mode used in speech. Degree of glottal opening
(H1*-A1) is mainly characteristic of type of disguise. The
values of Degree of Glottal Opening are widely varying in
type of disguise and normal modes, but variation is less
among the speakers. Such as whisper requires far greater
constriction than the voiceless setting of the glottis and in
breathy voice; normal vocal fold vibration is accompanied
by some continuous turbulent airflow. In both cases glottal
closure is incomplete [16-18]. The value of degree of glottal
opening was found more for female speakers as compared
with male speakers (Figure 11). c) F Degree of glottal opening (H1*-A1): Degree of glottal opening (H1*-A1) primarily reflects the laryngeal voice quality. As we know, vocal fold adduction is largely a function of posterior cricoarytenoid muscle action, and the opening of glottis is usually greater in voiceless mode than in any other mode used in speech. Degree of glottal opening
(H1*-A1) is mainly characteristic of type of disguise. The values of Degree of Glottal Opening are widely varying in type of disguise and normal modes but variation is less among the speakers. Such as whisper requires far greater constriction than the voiceless setting of the glottis and in breathy voice; normal vocal fold vibration is accompanied by some continuous turbulent airflow. In both cases glottal closure is incomplete [16-18]. It was found that the value of degree of glottal opening is more for female speakers as compare with male speakers (Figure 11).
The present study reveals that each parameter which is
investigated carried speaker specific information, but they are
outranked according to their striking features to show more inter
speaker differences and less intra-speaker differences in normal
as well as disguise modes. Spectrographic analysis reveals
good information about the intonation and formant pattern in
normal mode but could be changed in case of disguised mode of
speaking. However, among all the temporal measures, Phonation-
Time Ratio (P/T) was found be prolific parameter. The first fourformant
frequencies (F1, F2, F3 & F4) were also calculated for the
isolated text extracted from contextual speech. Further these are
ranked statistically in the order of F1
We would also like to express our appreciation to all our
volunteer speakers for providing their speech sample for
experimentation.Laryngeal Acoustic Parameter
Conclusion
Acknowledgement
References
- Atkinson JE (1976) Inter- And Intra- Speaker Variability in Fundamental Voice Frequency. J Acoustical Society America 60(2): 440-446.
- Clark JE, Yallop C (1995) An Introduction to Phonetics and Phonology”, Wiley-Blackwell, Washington, D.C, USA.
- Endres W, Bambach W, Flosser G (1971) Voice Spectrograms as a Function of Age, Voice Disguise & Voice Imitation, Journal of Acoustical Society of America 49(6): 1842-1848.
- Hollien H, Majewski W(1977) Speaker Identification by Long-Term Spectra under Normal and Distorted Speech Conditions. J Acoustical Society America 62(4): 975-980.
- Jesson M (1997) Speaker-Specific Information in Voice Quality Parameters. Forensic Linguistics4 (1).
- Kestra LG (1962) Voice Print Identification. Nature 196(4861).
- Kunzel HJ (1997) Some General Phonetic and Forensic Aspects of Speaking Tempo. Forensic Linguistic 4(1).
- Lindh J (2005) Visual acoustics vs. aural perceptual speaker identification in a closed set of disguised voices, Proceedings, FONETIK.
- Riech Alan R, Moll Kenneth L, Curtis, James F (1976) Effects of selected vocal disguises upon spectrographic speaker identification. J Acoustical Society America60(4): 1023-1028.
- Riech, Alan R (1981) Detecting the presence of vocal disguise in the male voice. J Acoustical Society America 69(5).
- Rodman Robert D Speaker recognition of disguised voices: A program for research.
- Rose P (2002) Forensic Speaker Identification Taylor & Francis, London and Newyork, USA.
- Zhang C, Weijer JV, Cui J (2004) Intra- And Inter- Speaker Variations of Formant pattern for lateral Syllables in Standard Chinese. Forensic Science International 158(6): 117-124.
- Lennon P (1990) Investigating fluency in EFL: A quantitative approach. Language Learning 40(3).
- Towell R, Hawkins R, Bazergui N(1996) The development of fluency in advanced learners of French” Applied Linguistics 17(1).
- Davis SB (1978) Acoustic Characteristics of Normal and Pathological Voices.
- http://www.icp.inpg.fr/~henrich/communications/199
- Fisher Jørgenson E (1967) Phonetic Analysis of Breathy (murmured) Vowels in Gujrati, Indian Linguistics 28.