Open Access Journal of Surgery (OAJS)

Characterization of Temporal and Acoustic Parameters for Speaker Identification in Disguised Speech

**Shivani Sharma*¹, Sanjay Kumar Jain² and Rakesh Mohan Sharma³**

¹Junior Scientific Officer, Central Forensic Science Lab, India

²Director, Central Forensic Science Lab, India

³Professor, Department of Forensic Science, Punjabi University, India

Submission: August 30, 2017; Published: September 08, 2017

*Corresponding author: Shivani Sharma, Junior Scientific Officer, Central Forensic Science Lab., MHA, Chandigarh, India, Tel: 9463495626; Email: shivanisharma.cfsl@gmail.coin

How to cite this article: Shivani S, Sanjay K J, Rakesh M S. Characterization of Temporal and Acoustic Parameters for Speaker Identification in Disguised Speech. J Forensic Sci & Criminal Inves. 2017; 5(1): 555651. DOI: 10.19080/JFSCI.2017.05.555651

Abstract

Identification of speaker through disguised voice sample is not uncommon but it poses an intricate problem and hurdle in crime case examination. Most of the time, perpetrator is uncooperative because of the fear of detection and always tries to hide his/her identity. In the present study, we have chosen two modes of disguised speech such as freestyle and by keeping handkerchief on mouth. Twenty speakers (10 males and 10 females, aged between 25-45 years) were selected for recording of speech samples. Total 3780 spoken words were subjected to Spectrographic analysis and temporal measures (Speech rate, Articulatory rate, Phonation-Time ratio). The acoustic parameters such as supralaryngeal [Formant frequencies (F1, F2, F3)] and laryngeal [Open Quotient (H1*-H2*), Degree of glottal opening (H1*-A1) and Glottal leakage (B1)] were studied for the identification of the speaker through disguised voice samples. Temporal measures are very effective parameters when speaker is trying to disguise his/her voice. It has been found that Phonation time ratio (P/T) and Open Quotient (OQ) are least affected in case of disguised speech as compared to Formant frequencies (F1, F2, F3), Degree of glottal opening (H1*-A1) and Glottal leakage (B1)

Keywords: Articulatory rate; Phonation-Time ratio; Open Quotient; Supralaryngeal; Laryngeal and Disguised Speech

Introduction

With the fast growth of technology and communication, the criminal activities are also increasing rapidly. The use of latest technology is at every next door, may be for good or evil cause and the limitations of the step- by-step system has been felt increasingly. Criminals are very much aware with the latest technologies and are always ready to beat the surveillance system. Criminals are using telephones, mobiles and satellite phones to communicate or for threatening, hoax and ransom calls and disguise their voice because of the fear of identification. In forensic scenario, speaker identification is laying down the mile stones with remarkable opinions given in many cases of national and political interest but speaker identification in case of disguised voice is still one of the grey areas in this progressive field. Thus, the research work carried out here is concerned with relatively immature area of speaker identification with disguised voice. Speaker identification in case of disguised voice is highly sought-after in forensic world as vocal disguise can potentially modify the acoustical characteristics of an individual voice.

The nature of speech sound is dependent mainly on three factors i.e. Dimensions of vocal tract, Mode of Phonation and Manner of Articulation. There is variation in vocal tract organs with time, health factors and with increasing age, which affect the formant values [1,2]. Although every individual has habitual and learned patterns of phonation and articulation, there would be small variation in each utterance of the same word or text during normal speech known as intra-speaker variation. This amount of variability in fundamental voice frequency F0 has little effect on the linguistic interpretation of an utterance and greater effect on prosodic feature for each utterance [3,4].However, the person can change the manner of articulation and mode of phonation intentionally during disguise. Riech et al. investigated the effects of few selected vocal disguises upon spectrographic speaker identification [5]. He also examined the ability of naïve and sophisticated listeners to detect the presence of particular disguises with a high degree of accuracy and reliability [6]. LTAS (Long Time Average Spectrum) were used by Lindh to identify speakers from closed set of disguised voices [7]. Hollien and Majewskis carried out experiments to study the normal controlled speech samples, under stress and disguised speech condition using Long Term Speech Spectra (LTS). The results demonstrated high level of correct speaker identification for normal speech sample, slightly reduced scores for speech during stress and markedly reduced correct identification for disguised speech samples [8]. Spectrograms are considered as reliable method for speaker identification in normal speech samples [5,9,10].

The previous studies [1,2,5,7,11,12] indicated that disguise indeed increases the intra speaker variation as seen in voiceprints. A lot of valuable studies had been carried out in the respective area but still the problem is as firm as rock. So being not only relying on voiceprints, few more parameters have been investigated such as Temporal measurements (Speech rate, Articulatory rate, Phonation-Time (P/T) ratio), Supralaryngeal measurements i.e. Formant frequencies (F1, F2, F3) and Laryngeal parameter (Open Quotient (H1*-H2*), Degree of glottal opening (H1*-A1) and Glottal leakage (B1)) which are highly characteristics of an individual.

In the present study, voice samples were taken in three different modes, first in normal speech and then speakers were allowed to disguise their voice in freestyle and keeping handkerchief in front of mouth mode. For freestyle disguise, speaker disguise in a manner in which the speaker felt that he would conceal his identity most effectively. Speaker changes the voice to taper off, to creaky, hoarse voice or in falsetto. The third mode of speech recording was using handkerchief placed in front of mouth (covering nose and mouth) by speakers while he/she is speaking.

Experimental Procedure

Experimental procedure is divided into three parts namely: Selection of speech material, & speakers, Recording procedure & environment and Analysis of Results.

Selection of Speech Material

Fifteen isolated words, which are frequently used in normal conversation and having importance in forensic speaker identification, were selected. Further using these isolated words, five contextual sentences were framed (Table 1). These isolated words and contextual sentences were given to each speaker to utter three times. These utterances were recorded in normal mode as well as two modes of disguised speech (freestyle and handkerchief in front of mouth).

Selection of Speakers

Twenty speakers (10 males and 10 females, aged between 25-45 years) having normal speaking habits were asked to read the selected speech material normally and in disguised modes (freestyle and handkerchief in front of mouth) as far as possible. It is pertinent to mention here that none of the speaker was professional imitator.

Recording Procedure and Environment

The recordings were made in sound treated room of Central Forensic Science Laboratory (CFSL), Chandigarh. The speech samples were recorded directly on computer equipped with CUBASE software through Sennheiser microphone model HMD25-1 600Ω at normal room temperature. The speech samples were recorded at the sampling rate of 44.1 KHz frequency and 16-bit quantization. Hence, a total of 3780 word files in normal and disguised modes have been stored in computer in the following format:

SXX_M_UU.wav

Where, S=Gender {Male (M) /Female (F)}

XX=Speaker ID {From 01 to 20}

M= Mode {Normal (0), Disguise Free Style (1) and Putting Handkerchief on mouth (2)}

UU= utt erance

These digitized speech samples were scale down to 22050 Hz for analysis in Computerized Speech Lab (CSL) Model-4300B, Kay Electronics, USA, available in Central Forensic Science Laboratory (CFSL), Chandigarh.

Acoustic and Temporal Analysis

All acoustic analyses were made using the Computerized Speech Lab (CSL) Model-4300B and Goldwave software. Spectrographic analyses (Wideband and Narrow band), Pitch contour, Energy contour, Supralaryngeal and Laryngeal acoustic parameters were determined through Computerized Speech Lab (CSL) and Temporal measurements were carried by using Gold wave Software. Spectrograms of normal and disguised modes were compared.

Results and Discussion

Spectrograms (SPG) are usually reliable representation of relative vowel quality, strengthening and weakening of stops, frication and aspiration, but there is great deal of individuality in the length and type of aspiration and frication. Spectrographic analysis was carried out for all the utterances in all three modes and the values of first four formants were calculated. The rate of transition of the formant and duration of stops varies from one individual to another. Figure 1 shows the spectrograms of the text ‘Hello I am Fine’ in all three modes i.e. normal, free style and handkerchief in front of mouth mode in window A, B & C respectively. The horizontal intense bands made up from vertical striations represent the formants. The value of frequency at particular formant varies from speaker to speaker. The third and fourth formants are not clearly visible in case of freestyle and handkerchief in front of mouth disguised mode where as these are clearly obtainable in normal speech. It can also be observed that the aspiration and frication parameters in the sentence / hello, I am fine/ is reduced and diminished in case of disguised speech. It is pertinent to mention that /f/ is dento-labial fricative. In the word ‘fine/, frication is followed by good formant pattern formed by the vowel /i/ and nasal sound /n/. The pattern of formant in the word /fine/ doesn’t alter in disguise mode but the values of second and third formant frequencies (F2 and F3) was changed observed in disguised mode as compare to normal mode utterances. To study co-articulation effect of followed vowels, the sound of /H/ was compared in the words “/HELLO/” and “/HAAN/”.

In the word ‘Haan’, /h/ is followed by long nasal/a/ vowel sound, the vocal tract appears to be completely open (lips completely open position) without any restriction from articulators and vocal folds keep on vibrating after the consonant closer. Thus, the duration of aspiration of /h/ in ‘Haan’ is shorter (0.046sec) as compare to the /h/ in ‘hello’(0.064sec) where /h/ is followed by /e/ (little spreading of lips) and little constriction of vocal tract appears to increase the duration of aspiration of /h/ in /hello/ (Figure 2). In the spectrographic study of contextual speech sentence ‘Where are you’ you’ in all three modes i.e. normal mode (window A), freestyle (window B) and by putting handkerchief on mouth (window C) mode (Figure 3), The formant pattern get distorted from normal (window A) speech sample as speaker tried to disguise (freestyle (window B) and by putting handkerchief in front of mouth (window C)). The position of formants shifts (both in frequency and time axis). The retroflex sound of word /r/ of ‘are’ is missing in free style (window B) mode but shift from word ‘are’ to ‘you’ remains unaffected. Similar pattern was observed in other utterances. Though there is variation in spectrogram patterns because of disguise but still there are many factors, which reflect the speaker characteristics, as it was difficult for the speaker to alter all the parameters simultaneously. Pitch contours were plotted and it was observed that there is high probability that speaker would attain his original fundamental frequency during speaking. Energy contours were also plotted, and it has been observed that the energy values of word ‘you’ was also approximately similar in all the utterances.

Temporal Measurements

In present experiment, temporal parameters such as speech rate, syllables per minute, articulation rate and phonation-time ratio were calculated for all the utterances of contextual speech text.

a. Speech rate: Total number of syllables produced in a given speech sample divided by the amount of total time required to produce the sample (including pause time), expressed in seconds. Speech rate has been calculated for all the twenty speakers in normal, freestyle and handkerchief mode. The bar diagrams of averaged speech rate for each speaker in all three modes have been plotted (Figure4). The speech rate varies from one speaker to another speaker as well as between the modes of speaking within the speaker. Similar trends have been observed in case of male and female speakers (Table 2).

b. Syllables per minute: Total number of syllables produced in a given speech sample divided by the amount of total time required to produce the sample (including pause time), expressed in minutes. Figure 5 represents the averaged syllable rate per minute over three utterances of each mode from all the twenty speakers. Syllable rate varies from speaker to speaker as well as in different modes. It was found that speaker can modify its syllable rate per minute and speech rate intentionally and can also increase pause time in between the syllables, which was in agreement with the findings of Kunzel [10] (Table 2).

c. Articulation rate (per minute): Total number of syllables produced in a given speech sample divided by the amount of total time required to produce the sample (excluding pause time). Articulation rate also was calculated for all the contextual speech text in respect of each speaker. The variation in the articulation rate between the utterances is less as compared to speech rate and syllable per minute for all the speakers. It varies in the range approximately 320 to 500 for normal 220 to 550 in disguised mode respectively (Table 2). Articulation rate gave less intra-speaker difference in case of normal and disguise speech samples as compared to speech rate and syllables rate per minute (Figure 6). Thus, articulation rate can be considered as promising parameter for forensic speaker identification.

d. Phonation time (P/T) ratio: Total duration of effective speech divided by total time to produce speech sample. The graphical plot of phonation time ratio for all three modes (normal, freestyle and handkerchief in front of mouth mode) shows the inter- and intra-speaker variation of the utterances. Averaged Phonation-Time ratio (P/T) has been calculated for each speaker in three different modes. The value of phonation-time ratio is almost similar for a speaker irrespective of the mode of speaking. The similar results were obtained in case of each male and female speakers (Figure 7). Hence, Phonation Time ratio was least affected as compared to speech rate, syllables per minute and articulation rate in case of normal and disguised modes (Table 2). The Phonation-Time ratio is found to be good predictor of fluency of a speaker by which speaker can be characterized. The results obtained are in the agreement with the earlier results of studies conducted by Lennon, 1990; Towell et al 1996 [13,14].

Supralaryngeal Acoustic Parameter

A speech sound created solely by vocal folds vibrations as the sound source will have its phonetic contents mainly determined by the first three formants; F1, F2, F3 and the relative distances between them. These are determined by the manner and position of articulation, so the structure of the formant frequencies will vary with each sound. We have calculated the first four-formant frequencies (F1 F2, F3 & F4) for the isolated and contextual text from the FFT and LPC spectrum for all male and female speakers. In disguised speech samples, higher formant frequencies (F3 & F4) show larger variation as compare to lower formant frequencies (F1 & F2). It has also been observed that second formant frequencies (F2) is affected more as compared to first formant frequencies (F1). Some speakers has tendency to shift second formant frequency closer to third formant frequency (Figure 8). Standard deviation was calculated to study the deviation on four formant frequencies values for all the twenty speakers in normal and disguised modes. The values for first four formants were averaged over all utterances in respect of each speaker for the selected text. The values of standard deviation were calculated for all twenty speakers (Table 3). It has been observed fourth formant frequency is most affected parameter whereas first formant frequency is least affected in case of disguised speech when compared w.r.t. normal speech. The order of variation was found to be: F1

Laryngeal Acoustic Parameter

In speech production, Larynx acts as a phonatory mechanism, transforming the airflow from the respiratory system into waveforms. Phonation types refer to the activity of the larynx and could be considered as laryngeal voice quality. Voice quality consists of physically induced voice characteristics and vocal settings, and that both make use of the similar acoustic parameters and physical characteristics lies in a speaker’s voice quality. Acoustic measurement of laryngeal voice quality method was first introduced by Fischer-Jørgenson in 1967 [12,15]. In addition to subpharyngeal acoustic parameters, we have made an attempt to study the Laryngeal acoustic parameters such as Open Quotient (H1*-H2*), Degree of glottal opening (H1*-A1) and Glottal leakage (B1) on the selected text. The calculated values of each parameter have been plotted to distinguish between different speakers even when they are trying to disguise their voices.

a. Open quotient (OQ): Open quotient (OQ) can be defined as the part of the glottal cycle during which the glottis is opened. In modal phonation the vocal folds are open during half of each glottal cycle and closed for the other half approximately. The duration of vocal folds of being open and closed changes during disguise and the value of open quotient depends on the type of disguise also. Figures 9(a) & (b) represents the plot of open quotient for female and male speakers respectively. It is clear from the figures that every speaker has individualistic pattern, which is independent of type of disguise and type of text spoken. The relationship between the OQ and the harmonics of a periodic signal is evident from Fourier analysis; as the OQ increases, the amplitudes of the higher harmonics decrease. Male speaker have a relatively shorter pulse than female and hence a smaller OQ such that the higher harmonics remain relatively strong, opposed to the relatively weak higher harmonics for women [15-17]. Hence, OQ can be considered as significant parameter to differentiate the speakers.

b. Glottal leakage (B1): Glottal Leakage (B1) may be defined as the bandwidth of the first formant. Speaker can shift their formant during disguise. The formant frequencies can also increase or decrease depending upon type of disguise. The first formant bandwidth varies with type of disguise as well as type of text spoken, but the variation is very less in case of some of the speakers which is in the range 0-5 (Hz) in both disguise and normal modes of speaking. Further the value of Glottal Leakage ranges 350-600 (Hz) in case of females and 250-450 (Hz) in case of male speakers respectively. The range of variation varies from speaker to speaker differently (Figure10). It has been observed that the glottal leakage is quite similar for normal and disguise modes in case of some of speakers and this can be used to differentiate among the speakers.

c. Degree of glottal opening (H1*-A1): Degree of glottal opening (H1*-A1) primarily reflects the laryngeal voice quality. As we know, vocal fold adduction is largely a function of posterior cricoarytenoid muscle action, and the opening of glottis is usually greater in voiceless mode than in any other mode used in speech. Degree of glottal opening (H1*-A1) is mainly characteristic of type of disguise. The values of Degree of Glottal Opening are widely varying in type of disguise and normal modes, but variation is less among the speakers. Such as whisper requires far greater constriction than the voiceless setting of the glottis and in breathy voice; normal vocal fold vibration is accompanied by some continuous turbulent airflow. In both cases glottal closure is incomplete [16-18]. The value of degree of glottal opening was found more for female speakers as compared with male speakers (Figure 11).

c) F Degree of glottal opening (H1*-A1): Degree of glottal opening (H1*-A1) primarily reflects the laryngeal voice quality. As we know, vocal fold adduction is largely a function of posterior cricoarytenoid muscle action, and the opening of glottis is usually greater in voiceless mode than in any other mode used in speech. Degree of glottal opening (H1*-A1) is mainly characteristic of type of disguise. The values of Degree of Glottal Opening are widely varying in type of disguise and normal modes but variation is less among the speakers. Such as whisper requires far greater constriction than the voiceless setting of the glottis and in breathy voice; normal vocal fold vibration is accompanied by some continuous turbulent airflow. In both cases glottal closure is incomplete [16-18]. It was found that the value of degree of glottal opening is more for female speakers as compare with male speakers (Figure 11).

Conclusion

The present study reveals that each parameter which is investigated carried speaker specific information, but they are outranked according to their striking features to show more inter speaker differences and less intra-speaker differences in normal as well as disguise modes. Spectrographic analysis reveals good information about the intonation and formant pattern in normal mode but could be changed in case of disguised mode of speaking. However, among all the temporal measures, Phonation- Time Ratio (P/T) was found be prolific parameter. The first fourformant frequencies (F1, F2, F3 & F4) were also calculated for the isolated text extracted from contextual speech. Further these are ranked statistically in the order of F1