Journal of Otolaryngology | Open Access Journals

Review Article

Prosody: An Overview and Applications to Voice Therapy

Daniel J McCabe*¹ and Kenneth W Altman²

¹Department of Otolaryngology, The Icahn School of Medicine at Mount Sinai, USA

²Department of Otolaryngology, Baylor College of Medicine, USA

Submission: April 24, 2017; Published: May 09, 2017

*Corresponding author: Daniel J McCabe, Department of Otolaryngology, Clinical Vocologist, Eugen Grabscheid Voice Center, The Icahn School of Medicine, One Gustave L Levy Place, Box 1189, New York, NY 10029, USA, Tel: 585-943-2117; Email: daniel.mccabe@mountsinai.org

How to cite this article: McCabe DJ, Altman KW. Prosody: An Overview and Applications to Voice Therapy. Glob J Oto 2017; 7(4): 555719. DOI: 10.19080/GJO.2017.07.555719

Abstract

Purpose: Prosody is the variation of dynamics, frequency, rate, and tone that carries contextual meaning in speech. The purpose of this study is to present an overview of prosody in speech from evolutionary, developmental and cultural perspectives, and to review implications in therapy for voice disorders.

Methods: Cross-disciplinary overview.

Results: Applications of prosody from linguistics, animal behavior, voice, music, and psychology were reviewed. Areas covered include evolutionary development of prosody in the form of animal models, linguistic and emotional overlays, social and pragmatic constructs, melody and prosody, conflicts among the drivers of prosody, and the use of prosody in voice therapy. Linguistic, emotional, and pragmatic values are all manifest in prosodic variations in voice and speech. Introducing prosody in voice rehabilitation serves a number of important functions, and clinical applications are presented.

Conclusion: Prosody provides layers of meaning beyond the word. It represents linguistic, emotional, and social elements that cannot easily be expressed through semantics. Disturbances in the production or apprehension of prosodic meaning can severely diminish a person's communicative ability. Voice therapies could take advantage of prosodic variables to better target disabilities in voice production and communication.

Keywords: Prosody; Voice; Speech; Communication; Language; Voice therapy

Introduction

The ability to communicate through sound exists in much of the animal world, and has developed with evolution. Humans have more complex voicing due to both anatomy and neural control, the ability to shape sounds through an elongated pharynx with oral articulation of words, and a higher ability to group words into language. But there is an even more complex meaning to speech that is conveyed through emotion and expression. Also, developmental and cultural influences affect the way language is communicated.

Prosody is the variation of voice dynamics, pitch, rate, and timbre that carry meaning in speech. Emotion, social values, and semantics are conveyed in these variations, which cannot easily be expressed through word meanings alone. Prosody therefore communicates emotional subtext, and even plays a critical role in animal and human interaction and survival. The importance of prosodic variations across animal species emphasizes the innate role of emotional communication. This is echoed in human development from infancy to adulthood, and in the spectrum of cultural diversity. Table 1 shows the three principle domains of prosody, which are pitch (inflection), dynamic (loudness or accent) and duration (rate of speech). The table gives examples of how these factors are used in animal communication, human development, and in different languages where these features are accentuated.

Disturbances in the production or comprehension of prosodic meaning can impair communication and decrease the quality of social interaction. Our understanding and use of prosody has the potential to enhance our therapies for voice patients, and aid in the treatment of conditions such as vocal fatigue, miscommunications, and aberrant voice patterns such as vocal fry. The purpose of this review is to present an overview of prosody in speech from evolutionary, developmental, and cultural perspectives, and to summarize how these principles can be applied to therapy for voice disorders.

Animal Models

The origins of emotional communication can be seen in the ways animals interact. While chemical or visual pathways may supply the majority of communication in certain species, it is the acoustic domain that interests us. In order for sound production from an animal to contain prosody, there needs to be variable use of pitch, loudness, and rate to convey an emotional state. Except in cases of duress, it is difficult to intuit an animal's exact emotional state, and care should be taken in assigning meaning [1]. Yet even excluding emotional intent, prosodic domains are present and we will look at some examples of animals making use of these variations.

Bird song represents the most florid use of pitch changes among animals. Finches provide one of the best examples of the length and complexity of a bird's ability to vary pitches and rhythms. The male finch courtship song can be dozens of notes long with complex rhythms. It is learned in 'chunks’ by young male finches, and is unique to each individual. This implies that selection of a mating partner by a female may largely be based on the particular characteristics of his song [2]. Heather Williams confirms that longer, more complex patterns tend to be more favored, thus selectively breeding the need and ability to make use of these variations. The complexity of the neural substrates used by the zebra finch in the learning, production, and apprehension of these songs is impressive (Figure 1).

Not only do birds produce patterns with variable prosody, but they can also understand and differentiate between patterns. Several research teams have studied the ability of Javanese sparrows (Paddaoryzivora) [3] and zebra finches [4] to distinguish between two different prosodic variations of a short Japanese phrase. Not only were the sparrows competent in this task, but they were able to generalize the same prosodic discrimination to novel sentences [5]. But care needs to be taken when assigning lexical or emotional meaning to bird vocalizations. Mimicry is the non-purposeful copying of a behavior, while imitation in animal studies is defined as the process in which an organism purposefully observes and copies the methods of another in order to achieve a tangible goal [5]. Irene Pepperberg, PhD and her colleagues have sought to show that the African grey parrot (Psittacuserithacus) can differentiate meaning and syntax, and makes use of the words taught to it for meaningful communication [6]. One subject named Alex had a vocabulary of 150 words and knew the names of 50 objects. Some researchers believe that the training method that Dr. Pepper berg used with Alex, (called the model-rival technique) holds promise for teaching autistic children who have difficulty learning language and empathy, both skills involving prosodic communication [7].

The domestic dog (Canis lupus familiaris) is one of the few animals to be selectively bred for human companionship. It is estimated that the dog has only been bred from the wolf (Canis lupus) for the past 15,000 years or so. In order for the dog to integrate with human endeavors, it was desirable for it to be able to comprehend human intentions and emotions. What better companion than one who can understand your emotions? The wolf already possesses a rich vocabulary of postural attitudes for communicating position within a pack, and its calls are used for social communication. It is therefore a small step for the dog to be selectively bred towards an ability to read meaning into human vocalizations.

The ability of dogs to imitate human prosody is seen in numerous YouTube videos. But what is fascinating is the intention with which dogs will observe the mouth movements of a child who is trying to communicate with the dog, and even copy the child's vowels and pitches. This is generally recognized as an awareness of others, as distinct from self. The intent involved in these examples is seen also in the dog’s ability to have joint attention. When you point to an object in front of a chimpanzee, it will raise its hand in mimicry - not understanding the referral. A dog's social intelligence will enable it to understand that you are directing its attention to something shared between you and it.

While Canis lupus familiaris was selectively bred to possess the abilities to imitate, if not express emotional communication, the evolution of human emotional expression was forced to rely on natural selection. One theory of human evolution holds that as Homo habilis progressed through Homo erectus to Homo sapiens there began to be an increased need for eusocial skills. Eusociality within a group demands the segregation of skills for the benefit of the group. This was necessary for the evolving hunter-gathering habilines. Further, it demands a social intelligence that values intuitive communication to better achieve collaborative efforts. With the rapid enlargement of cranial capacity from 680 cc in Homo habilis to the 1,400 cc of Homo sapiens that was partly fueled by the higher protein intake, there grew an increased ability to communicate the complex emotional subtexts that were so necessary for cooperative endeavors.

Human Development and Communication

The development of expressive communication skills in a child starts with their first cry at birth. Receptive abilities are thought to start in utero and would certainly be heard as changes in pitch, loudness, and tone. While initial vocalizations take the form of on/off pitch and loudness levels, the infant learns through mimicry and imitation during the first years that variations of intonation and loudness carry meaning on their own. As the larynx descends during the first year of development, the child begins to acquire 'conversational babbling'. This type of babbling incorporates the stresses, timing and inflections of mature speech but without discernible words. The extent to which this appears to fulfill some type of communicative intent is evident in the back-and-forth jargon speech between toddlers or even between a toddler and canine.

Disturbances in acquisition of prosody can occur in certain congenital disorders such as those on the autism spectrum, or Prader-Willi Syndrome [8]. One theory of language development [9,10], holds that prosodic abilities are a necessary precursor to the acquisition of linguistic abilities. The loss of this layer of contextual meaning in communication can result in poor social interactions, misinterpretations of received information, and even a down regulating of the pathways responsible for production and reception of suprasegmental information (variations of individual phonemes or words, or sound patterns within sentences are known by linguists as "suprasegmentals"). The effect can clearly be seen in those diagnosed with Asperger's Syndrome [11].

The variable use of prosody is an important part of pragmatics. Effective communication requires that we modify not only the semantics and syntax of our speech depending on the audience, but also the tone, cadence, and inflection. This type of code switching is easily seen in the way we speak to children (greater variety of pitch, dynamic, and tonal changes) versus the manner in which we would speak at a colloquium. The misapplication of these codes can have unintended and sometimes negative results. For instance, elderly patients in long-term care facilities are frequently offended when staff speaks to them in a sing-song child-like code. Similarly, the extent of our prosodic variations of inflection, stress and tone is expected to change in proportion to the distance between us and our audience. The degree of inflection used on stage would be inappropriate for conversation over a cup of coffee, and visa versa. Even the use of amplification does not permit us to decrease prosodic variation in a public speaking venue. The mind of the listener would still expect greater prosody for greater distances, the lack of which would literally be perceived as monotonous.

Prosody has been recognized as a crucial part of rhetoric and persuasion since at least the ancient Greeks. Plato in the Gorgias proposed that poetry was a subgenre of rhetoric (Gorgias, 502c), thus suggesting that rhythms, inflections, and other prosodic features of speech are a critical part of persuasion [12]. Ancient Chinese writings from Confucian, Daoist, and Mohist traditions refer to the importance of persuasion in speech, a concept termed mingbian by Xing Lu, PhD, as being a critical part of discourse and thought [13]. Cicero refers to rhetoric as 'speech designed to persuade' and includes the modulation of the voice (elocutio) as one of the five skills necessary in the development of rhetorical skills.

Rhetorical skills have been a central part of cultural education for most of the last three millennia. The western classically- based education of the past two millennia places rhetoric as one of the three principle studies found within the trivium. Tutors were provided to young Roman men to teach them the art of rhetorical delivery. In America primers on rhetoric were printed for the expanding group of educated young men and women into the 20th century. One such volume, The Columbian Orator was a central part of young people’s education in this country throughout the first half of the nineteenth century [14]. In its introduction it teaches the young citizens of the new republic that "The influence of sounds, either to raise or allay our passions, is evident in music. And certainly the harmony of fine discourse, well and gracefully pronounced, is as capable of moving us, if not in a way so violent and ecstatic, yet not less powerful, and more agreeable to our rational faculties. As persons are differently affected when they speak, so they naturally alter the tone of their voice, though they do not attend to it. It rises, sinks, and has various inflections given it, according to the present state and disposition of the mind."

Cultural Diversity of Languages and Music

It should come as no surprise that music and language should be compared in the manner in which they modulate to mirror the speaker's emotions. Each language has its own hierarchy of prosodic features that are used to clarify syntax and meaning. These prosodic elements are reflected in the music coming from each language. Germanic languages make use of dynamic stresses to mark strong syllables and to give emphasis, whereas other languages may rely more on inflectional accents (romance languages), or even durational accents (ancient Greek, Cantonese) on certain syllables. Not understanding the prosodic structures of a language can yield disastrous consequences and misunderstandings.

The development of Western musical notation illustrates the connection between the prosodic elements of a language and their representations in music. Early examples of Western psalters and written liturgical texts from the 6^th through 9^th centuries show these diacritical markings over the Latin and Greek texts (Figure 2) [15]. These diacritical markings served the same purpose as they do today in written scripts - showing relative inflection, pauses, and durations.

Musicologist Leo Treitler concludes his article on the origins of music writing by saying: "All Western notations in the beginning represented speech inflection. Either the notational symbols are the written syllables of speech themselves (as in the musicaenchiriadis, c. 900 AD) or they are written in the closest coordination with syllables. ... the earliest specimens are notations closely tied to syllables in syllabic or neumatic settings [16]. ”

Figure 3 shows how these diacritical accents were expanded into chironomes, indicating rising inflections such as in the phrase ó Xpибτoc υͭóςτ Θεоυ (Christ the son of God) at the end of that sentence (before the first + near the end of the second line). One can see how similar this circumflex chironome (~) is to our modern grammatical mark signifying a question - just rotate it 90 degrees clockwise, and place a period under it to indicate the end of a sentence. The fact that Spanish maintained the use of this grammatical mark at the beginning of a question indicates the part it played in conveying the inflectional intention of the line to the person who would have read such a sentence aloud to an audience. The relative downplay of dynamic stress in romance languages can be seen in the way melodies were notated into the 13^th century without any reference to strong and weak beats. One can see the inflectional nature of a language such as Old French as seen in Figure 4.

The use of measures and bar lines was not common until the renaissance when more pieces in English and German (with their reliance on strong and weak syllables) were being set. In fact, the liber usualis [17] - the collection of Latin catholic prayers and their musical settings - never did make use of measures or meters that would have implied strong and weak syllables.

In Western music, the use of dynamic, tonal, and other notations has steadily increased to the present. While this is meant to more explicitly convey the composer's intention for vocal colors from the singer, it changes the role of the performer. When there is only text and an indication of melodic direction, the performer is a first-person narrator, and can more easily overlay her own emotions onto the text through the use of prosodic elements. When the musical score is fully delineated with interpretive dynamics, changes in motion and tone, the performer becomes more of an interpreter of the composer's image of the text. The result is more like third-person narration.

Emotional Contexts

While the clarification of phonemic, syntactical and semantic constructs are important functions of prosody, it is the signaling of emotional state that is its most important function for voice therapy. Many communicative misunderstandings are the result of unintentional (or misintended) changes in vocal tone, loudness, or emphasis. How often have we heard, "It's not WHAT you said, but HOW you said it!" And because the emotional subtext of utterances can only be conveyed through the spoken word, we have created a shorthand method to signal these meanings in our written language. This can be seen in the previous several sentences in the use of CAPITALS, exclamation marks!, underlining, bold, italics, abbreviations (LOL), and emoticons to convey changes in vocal tone, accent, and inflection, all of which communicate emotional subtext.

So necessary and primal is this underlying layer of meaning, its absence is part of the definition of disability. If an individual is unable to convey or apprehend emotional meaning through suprasegmentals (such as inflection, facial gestures, or other gestures), this is seen as one of the criteria for such disabilities as Asperger's syndrome, autism, or traumatic brain injury. Conversely, an abundance of ability in conveying and perceiving emotional subtext can be highly valued, and is a skill used by the very psychologists and psychiatrists who diagnose those with the previously-named deficits.

The mechanisms by which emotions are communicated from one individual to another have been studied and explained by Antonio Damasio and many others in the neurophysiology field. Damasio’s separation of emotions from feelings is clearly described in his writings, and enables us to separate the neurophysiologic processes of expression/perception from those of interpretation/body state [18]. He further upsets our preconceived notions by showing that emotions, the public reflexes of our exposure to stimuli, are precedent to feelings, our private body state's interpretations of those stimuli [19].

Investigations into the mirror neuron system (MNS) [20,21] are providing insights into ways that we might be receiving emotional meaning from external stimuli [22]. In the broadest sense this type of shared emotional state is called empathy, but research is now showing how the MNS may be differentially supporting imitation and various forms of empathy such as cognitive (theory of self), motor, and emotional. Unfortunately, most investigations into avenues of external stimulus input are heavily weighted toward visual stimuli [23,24], and largely have not studied acoustic inputs such as prosody.

The absence or misapplication of pragmatic communication skills such as prosody can result from congenital conditions (Autism Spectrum Disorders (ASD), William’s Syndrome, Prader-Willi Syndrome, etc)or acquired from the brain injury resulting from strokes, pathologies, mechanical injuries, or emotional traumas. The bidirectional pathway of emotive expression and reception may be so tightly constructed, that in these cases both the perception and expression of emotions can be affected. Alexthymia is the name for this decrease or absence of prosodic use and awareness in an individual, resulting from certain acute psychological conditions or emotional trauma. Depression or severe psychological trauma, especially in younger individuals can cause a dissociation in which the individual no longer communicates emotions through the use of prosody, facial gestures and other physical signifiers. This may be a self-protective mechanism. The nature of this dissociation is debatable, with evidence pointing variously to deficits in the MNS and/or the individual's psychological responses to emotional trauma.

Psychoanalyst Joyce McDougall objected to the strong focus by clinicians on neurophysiological rather than psychological explanations as a basis for alexithymia. She created the term disaffectation to stand for psychogenic alexithymia [25]. McDougall proposed that the disaffected individual had at some point experienced overwhelming emotion that threatened to attack their sense of integrity and identity, to which they applied psychological defenses to eliminate all emotional representations [26].

Applications to Voice Therapy

Strong emotional substrates and the use of the voice for communication of those emotions each allow prosody to be used in voice therapy. Variations of rate, dynamics, and inflection all make use of the muscles of phonation and articulation, they integrate many pyramidal and extrapyramidal pathways in their use, and they may better be generalized because of the strong emotive and communicative associations [27].

As discussed previously, prosody is necessary for pragmatic communication such as when code-switching. The relative absence or superabundance of prosody can signal possible acquired or congenital conditions. Prosodic variations necessitate the differential use of muscle pairs in the larynx, which can increase the efficiency of voicing and strengthen emotional communication. Table 2 shows examples across the range of prosody from animal, developmental, linguistic, and pragmatic domains.

The final column suggests a similar movement from cortical areas of control to limbic and brainstem areas that might correlate to the shift from lesser to greater degrees of prosodic variation. Prosody in voice therapy is particularly useful in targeting vocal fatigue, functional dysprosodia (vocal fry, accent modification, gender coding), and pragmatic dysprosodia. Table 3 outlines these categories, their etiologies, functional deficits, treatment goals, and some representative therapies.

Vocal fatigue is a peripheral fatigue involving muscles and transmitters at the neuromuscular junction that usually results from a misbalance of demands versus capacity. One theory proposes that this neuromuscular transmission failure can result from sustained co-contractions of muscle pairs, as often occurs in compensatory tension behaviors [28]. Behaviors that can result in vocal fatigue include those where the speaker uses less prosodic variations of loudness, pitch, and rate.

One population highly susceptible to this lack of prosodic variation is beginning teachers. A new teacher will often attempt to acquire control over a class by using an authoritative voice - a lowering of average fundamental frequency, an increase in subglottal pressure, and a measured rate of delivery [29]. Young teachers will frequently start to experience severe vocal fatigue by their second year of teaching, by which point they have acquired this maladaptive behavior. By habituating these teachers to a greater use of variable pitch, dynamics and rate, they learn to make a synergistically variable use of opposing muscle pairs, avoiding the co-contraction that can result in peripheral fatigue.

Functional dysprosodia (vocal fry, accent, gender) is similarly a habitual, non-pathologic behavior. Vocal fry is the result of a reduction in all aspects of functional vocal input from respiratory, adductory, and prosodic domains. This can be a socio-pragmatic attempt to acquire the latest vocal Zeitgeist, or a genuine result of central fatigue such as one experiences at the end of a long working day. The result of this static positioning of muscle pairs can be decreased capabilities in louder situations, and vocal fatigue. Again, a rehabituation of greater variety in pitch inflection, dynamic variation and changes in tone and rate is usually all that is required to help a 17 year-old regain the flexibility and stamina required to sound more intelligent and interesting to the rest of the world. Accent and gender therapies are meant to habituate new voice and speech patterns that would be perceived as more appropriate by the speaker and social mores.

For the purposes of voice therapy, pragmatic dysprosodia is distinct from vocal fry in the sense that it represents a central pathology in the form of an emotional disassociation. This dissociation is either the result of a congenital or mechanical deficit, or from an emotional trauma. Congenital conditions such as ASD, and Willi-Prader are well-known to be marked by decreased pragmatic communication skills. Depending on the level of ability of the patient, pragmatic acoustic cues in speech can be taught, although the degree of integration into more central receptive and expressive centers is debatable [30]. Similarly, traumatic brain injuries and cerebral vascular accidents can also affect a patient's receptive and expressive abilities with prosody. This deficit can appear in the form of a misinterpretation or absence of socio-emotional cues in speech [31]. Again, as with congenital disorders, therapy can often either increase the awareness of pragmatic cues, or re-label their meanings for the patient.

A pragmatic dysprosodia production can also result from an emotional trauma. This type of alexthymia is often seen as a self- protective disassociation of the patient from emotional content. This usually takes the form of monotonal or monodynamic speech. A study of veterans with chronic PTSD showed that "compared to healthy comparison subjects, veterans with chronic PTSD performed poorly on the Aprosodia Battery, on a par with patients with right hemisphere brain damage [32]." While deficits acquired from emotional trauma such as PTSD must be addressed by a licensed psychotherapist, the voice therapist can play a significant collaborative role in rehabituating and relabeling the affected voicing patterns. As seen throughout the corpus of Damasio’s work, both the generation and apprehension of physical emotions can trigger the emotions themselves, and care needs be taken to work with the psychotherapist to prevent dissociations when eliciting greater prosodic variations in those with emotionally-triggered alexthymia.

Conclusion

Prosody plays a critical role in communication, across animal species, at different stages of human development, and across cultures. It conveys emotional subtext through signals in voicing parameters such as pitch inflection, dynamic stress, rate, and tone. Our understanding and use of these domains of prosody can enhance and expand our voice therapies, helping to treat conditions such as vocal fatigue, miscommunications, and aberrant speech and voice patterns such as vocal fry.