Articles

Home | Dr. Speech 4 | Dr. Speech 5 | Distributors | Information | Contact us

Voice Lab in Clinical Practice

Speech Skill Builder for Children

Use and Understanding Voice Lab for Singers

Relationship Between Acoustic Measures of Voice and Judgments of Voice Quality

Respiration

 

Voice Lab in Clinical Practice

Daniel Zaoming Huang, Ph.D.
Tiger DRS, Inc.
 
Henian Huang, M.D.
Otolaryngology Institute, Shanghai Medical University
83 Fen Yang Road, Shanghai 200031, P.R. of China

1. Introduction

Computer, with clinical software, provides valuable assistance during assessment and treatment of voice disorders. Demographic information, such as names, address, number of visits, types of disorders, progress reports, insurance claim, etc. are easy to display. When such clinical software is installed in a laptop computer, it provides clinicians with a "portable clinical voice laboratory" equipped with powerful clinical tools that easily can be carried from one treatment location to another. Today’s practitioners can benefit from the use of clinical software that has been adapted to the needs of clinicians.

Acoustic analysis provides quantitative assessment of voice quality and vocal function. EGG measure gives non-invasive objective information on the contact behavior of vocal fold vibration. Acoustic, and electroglottographic (EGG) measures should be used in the routine clinical examination, as substantially complements endoscopy examination. All of three measures can be done by the "Dr. Speech" and "ScopeView" software from Tiger DRS, Inc., and help laryngologists and voice pathologists make a correct diagnosis of voice disorders and/or monitor progress in voice therapy.

2. Methods and Procedures

This new technology is being introduced to help perform important diagnosis and therapy procedures for laryngeal disorders by using multimedia technology with application in endoscope. Such technology enables outpatient procedures to gather a wide array of information not only for laryngologists but also for voice pathologists, thus making it easier to work together. This paper introduces a power tool designed to enhance clinical observation of vocal fold vibration while simultaneously showing electroglottographic and acoustic waveform, as shown in the figure 1.

Figure 1. Video with acoustic and EGG signals simultaneously

2.1 Objective Measures of Acoustic and EGG

This technology provides not only measures of F0, intensity, jitter, shimmer, glottal noise, ratio, tremor, statistics and spectrum from voice signals, but also measures of EGG-pitch, EGG-intensity, EGG-jitter, EGG-shimmer, EGG-noise, CQ, CQ perturbation, CI, CI perturbation, opening rate, and closing rate from EGG signals. The figure 2 is a real-time display of acoustic and EGG display with power spectrum and contact quotient analyses. A number of parameters can be got after recording immediately. There parameters can be of considerable use in arriving at a diagnosis for evaluating the effects of surgery, or for tracking progress during voice therapy.

Figure 2. Acoustic and EGG signals with power spectrum and contact quotient

2.2 Objective Observation of Vocal Fold Vibration

Glottal closure patterns such as mucosal wave, tissue elasticity and other irregularities can be easily observed in real-time with help from this technology. Our clinical experience using this technology to observe single and multiple consecutive images of the vocal fold vibration indicates that it is easy to inspect the physiological and pathological status of the larynx and to monitor the effect of voice therapy. Digital processing, such as editing, enlarging, removing noise, smoothing, and sharpening has been applied to enhance the image clarity for the clinicians to view and analyze the patients’ larynx carefully. The figure 3 shows a multiple-frame viewing feature with glottal area detection (yellow color).

Figure 3. Glottal area detection during vocal fold vibration by the edge detection method

3. Results

3.1 Voice Disorders Database and Clinical Implication of Acoustic and EGG Parameters

The figure 4 provides an example of a voice and EGG profile for a papilloma patient after surgery (male, age 42). Top window shows ten voice and EGG parameters from the patient, and the normal range of these ten parameters. Bottom window is an estimated vocal function (regularity of vocal fold vibration and glottal closure time) and an estimated voice quality (harsh, breathy and hoarse). The figure 5 show F0, intensity and spectrogram for the acoustic signal, and vocal fold contact behavior from the EGG signal. Clinical implication about these parameters are described as (1) Hoarseness can be considered as a combination of breathiness and harshness, (2) Jitter appears to be related primarily to harsh voice quality, (3) Shimmer appears to be the primary influence on hoarse voice quality, (4) The magnitude of glottal noise energy (NNE) closely corresponds to the breathy voice quality, (5) EGG measures are chiefly related to laryngeal behavior that occurs during vocal fold contact, (6) CQ reveals information about degree of glottal closure, and CI gives information about symmetry of vocal fold vibration, and (7) CQP and CIP reveals the regularity of vocal fold vibration.

Figure 4 Vocal function estimates                               Figure 5. F0, intensity, spectrogram and CQ displays

Sustained phonation taken from a large amount of normal (2,937) and patients with voice disorders (902) were recorded and stored digitally to create the Voice Disorders Database. Figure 6 shows a database of normal vs. recurrent laryngeal nerve paralysis group by NNE. Figure 7 is a database of normal vs. glottic cancer group by using jitter and NNE.

Figure 6. Normal vs. RLN paralysis by NNE       Figure 7. Normal vs. glottic cancer by NNE & Jitter

Table. Discrimination Between Normal and Pathological Voices

Laryngeal Pathology

Number of samples
detected as Normal
Number of samples
detected as Pathological

Total

Normal

2685

252

2937

Glottic cancer T1

13

49

62

Glottic cancer T2

1

23

24

Glottic cancer T3-4

0

18

18

Vocal fold polyp

26

52

78

Vocal fold polypoid

21

55

76

RLN paralysis

5

30

35

To evaluate the measures in distinction between normal and pathological voices from acoustic and EGG measures, the experiment was made as in Table 1. A multiple value was obtained from several acoustic and EGG parameters. The voice sample was regarded as normal if the multiple values was smaller than a threshold, and as pathological if it was larger than, or equal to, the threshold.

3.2 Quantitative Information from Video Images with Acoustic and EGG Signals

Glottal area profile of complete four cycles of vocal fold vibration is provided in the figure 8. Some important parameters, such as open quotient (OQ) can be obtained from this glottal area profile. With practice, perceptual judgment of video images provides a great deal of information, but such quantitative information warrants conclusions because vibrations depend on F0, intensity and vocal register. F0 is increased with increasing vocal fold tension or stiffness, and decreased as vocal fold mass increases.

 
Figure 8. Glottal area change during vibration                             Figure 9. Ratio

In figure 9, objective information about the ratio of vocal fold length and width (RLW) and the ratio of glottal area height and width (RHW) during inspiration is useful to draw correct conclusion. Table 2 provides objective data for eight patients with unilateral left RLN paralysis, and shows a significant difference in ratio (RLW) between two vocal folds, higher open quotient (OQ) as well as lower F0.

Table 2. Quantitative Information for the patient with unilateral left RLN paralysis

Unilateral RLN paralysis

RLW (left)

RLW (right)

RHW

OQ (%)

F0 (Hz)

Subject 1 (male, 32 y)

6.64

9.91

12.3

52.1

112

Subject 2 (male, 24 y)

1.42

3.31

10.6

62.4

105

Subject 3 (male, 55 y)

3.22

5.44

7.7

57.5

127

Subject 4 (male, 47 y)

6.01

7.88

8.5

61.9

98

Subject 5 (male, 37 y)

3.55

5.66

11.2

47.8

121

Subject 6 (female, 21 y)

2.11

6.44

9.9

77.5

178

Subject 7 (female, 44 y)

2.37

4.82

6.7

63.9

196

Subject 8 (female, 57 y)

1.25

3.77

8.9

75.6

210

4. Conclusion

The quantitative information from video image, acoustic and EGG signals is very useful in order to perform a reliable vocal assessment and therapy pre-operatively and post-operatively. Documents and color images should be printed for patient records and insurance purpose for before and after surgery.

Acknowledgments

Research for voice disorders has been supported by Dr. Colin Watson, Prof.. A. Maran at The University of Edinburgh in UK, and Dr. Mara Behlau at Clinical Voice Center in Brazil. The first was surprised by Prof. H. Kasuya at Utsunomiya University in Japan and by Prof. Fred Minifie at The University of Washington in USA. Their expertise has provided a positive input to the design of this new technology.

References

Huang, Z., & Hu, N. (1988). Research for laryngeal cancer evaluation and diagnosis, Journal of Biomechanics, 3-2, 15-20.

Huang, Z., Minifie, F., Kasuya, H., & Lin, X. (1995). Measures of vocal function during change in vocal effort level, Journal of Voice, 9(4): 429-438.

Kasuya, H., Ogawa, S. & Kikuchi, Y. (1986). An acoustic analysis of pathological voice and its application to the evaluation of laryngeal pathology, Speech Communication, 5-2.

 

 

Speech Skill Builder for Children

Daniel Zaoming Huang, Ph.D., M.Engr.

Abstract

This paper uses voice-activated game-like tool to provide real-time reinforcement of a client’s attempts to produce changes in pitch, loudness, voiced/unvoiced phonation, voicing onset, maximum phonation time, sound and vowel tracking. Children, in particular, enjoy therapy with this colorful, interactive, video game because they receive immediate feedback on their performance. Clinicians will enjoy the versatility and unique features of this technique. For example, while a child is playing a game, you can quickly review the graphical display or statistical data of the child’s performance. This technique is divided into two groups: 1) Awareness teaches children about the attributes of their voice, and (2) Skill Builder gives the user goals to achieve for a given range and time. The examples of comprehensive user logs and tracking client’s progress are provided. Best of all, real-time recording and playback gives you the tools you need to maximum your client’s therapy.

1. Introduction

Innovative computer technologies are not only helping the needs of persons with speech disorders, but also serving the laryngologists and speech pathologists to perform a more accurate and professional service. This paper provides you a tool to start this challenge in a more efficient way. Not only for your needs to perform a better report or make therapy efficient, but also for the clients who are counting on you and your ability to treat them in the best way. With a PC desktop or laptop computer, a 16-bit sound card, a microphone and speakers, the clinician has met the simple requirements for starting and operating a speech laboratory in the clinical practice.

2. Methods and Procedures

Speech is a product of the interaction of respiratory, laryngeal and vocal tract structures. The larynx functions as a valve that connects the respiratory system to the airway passages of the throat, mouth and nose. The vocal tract system consists of the various passages from the glottis to the lips. It involves the pharynx, oral and nasal cavities, including the tongue, teeth, velum, and lips. The production of speech sounds through these organs is known as articulation.

For speech skill builder, it is necessary for us to focus on the acoustical and physiological phenomena in both laryngeal and vocal tract systems. The parameters, such as, pitch, loudness, voicing, voicing onset, phonation time and formants, are closely related to these two systems. This paper employs Speech Therapy program, a clinical software from Tiger DRS. This software provides real-time cartoon displays of continuously varying pitch, loudness, voicing, voicing onset and phonation time displays so the children can receive immediate feedback on his/her performance with fun. In other wards, the acoustical and physiological phenomena from the children can be evaluated from this technique. Clinical application of this technique will be described in the following experiments for details.

Experiment 1: Pitch Skill Builder

Using pitch module, clinicians could help the children refine pitch control and develop smooth modulation of pitch contour. Certain patients are unconsciously or consciously making an effort to higher or lower their pitch. The clinician should teach patient to target optimum pitch by the control of vocal fold vibration. For example, one of the best way to refine pitch control is to use rise-fall pitch technique. In the Figure 1 (a), by extending /a/ in front of microphone, the boat moves around the rocks based on a rise-fall pitch pattern. With this game, the children receive immediate feedback on their pitch performance. After the game, the clinicians can look at objective information of the pitch control, as shown in the Figure 1 (b). In clinical practice, the clinician may select different pitch patterns for different needs of the patients.

 
(a) Real-time cartoon display                           (b) Objective information of pitch curve
Fig. 1. Pitch controls how the boat moves around the rocks (target: rise-fall pitch pattern)

Pitch measure provides information about intonation. The pitch is mainly decided by the rate of vocal fold vibration. In the Pitch Skill Builder, the clinicians should help the patients to find the optimum pitch and pitch range and how to maintain this optimum situation. In clinical practice, a complete statistical report before or after therapy is important. The Table 1 lists the pitch changes during three-week therapy by pitch skill builder technique for the male patients with female voices. The result of speech therapy is obviously.

Table 1: Pitch changes during three-week therapy

 

Patient 1

(male, 12 y)

Patient 2

(male 15 y)

Patient 3

(male, 17 y)

Therapy Technique

Ave. Pitch
(week 1)

282 Hz

325 Hz

208 Hz

 
Ave. Pitch
(week 2)

253 Hz

287 Hz

181 Hz

Warm-up
"rise-fall pitch" skill builder
Ave. Pitch
(week 4)

232 Hz

262 Hz

157 Hz

Warm-up
"flat-pitch" skill builder

Experiment 2: Loudness Skill Builder

Using loudness module, clinicians could help the children lower the loudness level of speech when the usual level is higher, and higher the loudness level when the usual level is lower. The clinician should teach the patient to control his/her loudness change by the correct control of breathing. For example, one way to control loudness is to use correct control of breathing and body position. In the Figure 2 (a), by increasing the loudness through a good body position, the fireman climbs higher toward the top target. With this game, the children receive immediate feedback of loudness changes with their different body position (standing vs. sitting). After that game, the clinician can look at the different loudness data (standing vs. sitting), as shown in the Figure 2 (b). The top target corresponds to a certain loudness level that can be modified by the clinicians.

 
(a) Real-time cartoon display                        (b) Objective information of loudness curve
Fig. 2. Loudness controls how higher the fireman climbs (target: top).

Loudness measure provides information about syllable stress. The intensity of vocal fold vibration is decided mainly by the loudness. In the Loudness Skill Builder, the clinicians should find the best way for the patients to make a target. The Table 2 lists the loudness changes during seven-week therapy by loudness skill builder technique for the patients with right RLN paralysis. The result of speech therapy is obviously.

Table 2: Loudness changes during seven-week therapy

 
Patient 1
(male, 11 y)
Patient 2
(male 13 y)
Patient 3
(male, 14 y)

Therapy Technique

Ave. Loudness
(week 1)

61.1 dB

66.5 dB

68.2 dB

 

Ave. Loudness
(week 3)

63.4 dB

67.1 dB

69.1 dB

Warm-up.
Standing phonation by turn-head left.
Loudness skill builder with correct control breathing.
Ave. Loudness
(week 8)

66.2 dB

67.8 dB

71.3 dB

Warm-up.
Sitting phonation by turn-head left.
Loudness skill builder with correct control breathing.

Experiment 3: Voicing Skill Building

Using voicing module could help the children assess their voiced and unvoiced phonation from the computer screen. Voicing refers to the vocal behavior by which the conversion of continuous airflow into a series of glottal pulses is regulated. Voiced phonation, such as /z/, is regulated by the vocal fold vibration, while voiceless phonation. such as /s/, is not regulated by the vocal fold vibration. For example, one way to feel voicing is to produce a pair of phoneme /s, z/, /f, v/ etc. In the Figure 3, when you phonate a voiced sound, a mouse (red) will come from left side; when you have a voiceless sound, a mouse (green) will appear from right side.

Fig. 3. Voicing mode determines which of the mice will run.

Voicing measure provides information about phonatory pattern. Using voicing onset module, clinicians could assist the children with modification of glottal attacks before the appearance of supraglottal articulatory event.

Experiment 4: Voicing Onset Skill Building

Using voicing onset module, the clinician can help the children to control the vocal fold attacks correctly. In the Figure 4, when you initiate a voiced phonation, a flower will open. If you saw /ba/, /po/, the first flower will open at the beginning of /b/, and the second flower will open at the beginning of /o/ because /p/ is a voiceless phoneme.

Fig. 4. Voicing onset mode controls how the flower opens around the tree.

Voicing onset provides information about glottal attacks. How fast can you make the ten flower open ? What happens if you extend a vowel, but have voice breaks ? All these cases depend on the voicing onset.

Experiment 5: Phonation Time Skill Building

The term, Maximum Phonation Time (MPT), implies such abilities in voice production as how long one can sustain phonation. The patients are instructed to sustain vowel /a/ or other vowel as long as possible following deep inspiration. MPT is decreased in many pathological states of the larynx, especially in cases with incompetent glottal closure. MPT values smaller than 10 seconds should be considered to be abnormal. For example, the clinicians should provide the patients the best way to make the respiration and phonation correctly. In the Figure 5, the strawberry moves from left to right when you keep phonation after deep inspiration. The target for you to reach is at right side. The target setting can be changed for the needs of patients.

Fig. 5. Keeping phonation moves the strawberry from left to right.

Experiment 6: Speech Articulation

Speech articulation within vocal tract is determined by three major factors: the place of major constriction, the degree of constriction at that point and the lip constriction, as in Figure 6 and Figure 7. The vocal tract shape and lip movement will be provided for each vowel and consonant. In clinical practice, a brief education about speech articulation (tongue and lip movement) should be provided before therapy.

 
Fig. 6 Vowel production (vowel /i/)              Fig. 7 Consonant production (vowel /p/)

Real-time vowel space training reveals first and second formants for speech inputs. With this tool, clinician can show patient about the effect of major constriction place in vocal tract from computer screen. The tongue tip movement mainly determines the second formant changes. For example, when the children produce a series of vowel /I-e--a -u/, the vowel tracking will appear as in the Figure 8. By the graphic display, the clinician can judge the tongue tip position and phonetic accuracy quickly.

Experiment 6: Speech Articulation

Speech articulation within vocal tract is determined by three major factors: the place of major constriction, the degree of constriction at that point and the lip constriction, as in Figure 6 and Figure 7. The vocal tract shape and lip movement will be provided for each vowel and consonant. In clinical practice, a brief education about speech articulation (tongue and lip movement) should be provided before therapy.

Fig. 8 Dynamic vowel tracking

Experiment 7: Sound Awareness

In Sound Awareness module, the children should understand normal speech level. Another important thing is to have children to understand difference among non-speech, speech, whistle and hiss. In the Figure 9, the clinicians can help patients to understand how much loudness or effect is necessary to move the graphic. The sound can be set to indicate a normal. conversational speech level. If you set it too high, you might not get the object to move at all.

Fig. 9. A seesaw moves when there is a sound over silence setting.

3. Conclusion

The speech therapy demands from a hospital require the implementation of simple and well-defined therapy and assessment technique. Where pitch, intonation, stress, loudness and articulation are of primary interest, a good and efficient speech therapy tool, such as Speech Therapy software, is essential in clinical practice.

Acknowledgments

Research for voice disorders has been supported by Dr. Colin Watson in The University of Edinburgh (UK), and Prof. Wei Wang in Shanghai EENT Hospital (China). Research for speech acoustics has been advised by Prof. Fred Minifie in The University of Washington (USA). Their expertise has provided a positive input to the design of this technique.

 

 

Use and Understanding Voice Lab for Singers 

Daniel Zaoming Huang, Ph.D.

Colin Watson, Ph.D.
Department of Otolaryngology
University of Edinburgh, Edinburgh, UK

1. Introduction

The success of opera made it necessary to seek out new singers and to develop singing training in order to secure growth and continuity in the new artform. To the singer, the voice appears to dissociate itself from the larynx and become present in the resonant resource of the vocal tract and within the acoustics of the opera house.

The different styles of classic, musical theater, and pop music, requires that the singer produce a wide range of tonal qualities. For classical singers, there is a demand to maximize the resonance of the voice and find a mode of production that will allow them to find an accommodation between resonance and clear diction over a wide working range of pitches. For music theater singers, the major demand appears to be related to direct communication of the text and, in this style, tonal quality will always take second place to diction and the word. In the pop field, the sound demand can arrange all the way from coarse and frantic, to mellow, laid back, and sentimental (Miller 1959). In this range of demands, the diction can go from completely unintelligible to crystal clear. The world of pop music covers a wide range of performance styles and within this diverse field, one can encounter performers whose skill levels range from the advanced to the untutored and technically inept.

Over the last decade, understanding of singer voice has rapidly advanced through the availability of computer based instrumentation. Since a wide variety of tonal and articulatory demands are reflected in the different vocal genres, it seems necessary to provide a quantitative procedure which can evaluate singing voice quality and monitor the effect of singing training in modern voice studio. The purpose of this paper is to provide an affordable portable voice lab to voice teachers and/or singers. Designed to enable voice teacher and singers to use the powerful techniques of voice analysis and training without incurring the costs of special hardware, Dr. Speech are Microsoft Window software system which make use of the standard multimedia capability of today’s personal computers. With this affordable voice lab, some general guidelines can be given for identifying poor vocal production based on the vocal sound.

2. Vocal Parameters for Singers

Generally speaking, the vocal sound is acoustically characterized by diction, tone quality, range, pitch, vibrato and singer’s formant (Minifie, Hixon, & Williams 1973). Since the vibrato and singer’s formant might be instantly recognized by a singing teacher, these two features were investigated.

2.1 Vibrato

Vibrato is a 4-6 Hz tremor which appears gradually as a singer develops the neuromuscular ability to sustain vowels in a resonant vocal tract and against substantial transglottal pressure. Vibrato is an essential part of the musical quality of the voice and is not controllable other than by controlling sub-glottal pressure.

The acoustic features of vibrato (Huang, Minifie, Kasuya & Lin 1995) were investigated by Dr. Speech software. In figure 1, the narrow-band spectrogram is excellent for assessing the location of the available formants and observing the relative strength of the harmonics of the sung tone. The positioning of these formants establishes the identity of vowels and the musical quality of the singer’s voice. The spectrogram shows that the excessive amplitude of the vibrato is contained by the low formants. Vibrato is multiplied by the harmonic number and the first five harmonics show this increase.

Fig. 1. The narrow-band spectrogram shows a large vibrato as a quasi sinusoidal modulation of the harmonics. (From Dr. Speech)

Fig. 2. The pitch and intensity displays show the vibrato cycle with negative and positive movements. (From Dr. Speech)

In figure 2, a plot of pitch and intensity display clearly indicates the effect of vibrato as a means of seeking out formants in order to enhance the musical quality of the sung tone. This plot may be associated with the points where harmonic concur with formants in the negative and positive going sweeps of vibrato. It can be seen therefore that limited vibrato can maximize the singers access to the resonant resources of the vocal tract.

2.2 Singer’s Formant

In acoustic terms, the advantages gained by the singer with higher harmonic enhancement are very substantial. The human ear is relatively sensitive in 1000 Hz-5000 Hz. In effect, the singer with poor higher harmonic component support looses out in audibility in the opera house and may attempt to compensate by forcing the voice. The 2500 Hz-3500 Hz "Singer’s Formant" lies within a bandwidth which successfully clears the masking potential of the classical orchestra.

The acoustic features of singer’s formant were investigated by Dr. Speech software. The figure 3 provides the LPC spectral display. The first formant lies at 420 Hz. The second formant is 1840 Hz with the third formant at 2540 Hz. The fourth formant is around 3250 Hz which assists the third formant to establish a higher than normal higher harmonic strength in the case of low pitched opera singers. Adduction of the third and fourth formants is but one explanation for the singer’s higher harmonic achievements. The figure 4 shows the power spectrum of the sung vowel positioning against the harmonic distribution.

Fig. .3. LPC spectrum estimates formants and singer’s formant. (From Dr. Speech)

Fig. 4. Power spectrum (LTAS) estimates harmonic distribution. (From Dr. Speech)

3 Vocal Training in Singer’s Studio

The huge computer industry investment in sound feature provides dynamic development which is set to diminish the need for purchases to pay high costs of unique signal processing hardware. With an affordable Dr. Speech software, vocal training become simple.

3.1 Real-time F0 Training

The F0 training allows real-time pitch extraction from acoustical input. Model-matching feature can be used for target modeling. For singer and voice teacher, this function display the dynamic range of the human voice in term of F0. In figure 5, the F0 pattern of the instructor can be stored on the computer screen in blue color, and then student can compare the performance of an attempt to match the instructor’s pattern by tracing in red color.

Fig. 5. In the model-matching mode, real-time pitch training is useful for both vocal training and voice teaching. (From Dr. Speech)

Fig. 6. Real-time plotting of formant is useful to show singer’s formant for both vocal training and voice teaching. (From Dr. Speech)

3.2 Real-time Formant Training

Real-time formant display (or call LPC spectrum) graphically reveals vowel formants and bandwidth. The singer’s formant can be observed dynamically. This real-time training feature provides a powerful tool to singer and voice teacher. With this tool, singer and voice teacher can easily assess their vocal ability from computer screen. The figure 6 shows a plot of real-time formant display. It also provides a clear display of vowel /i/ and /a/ difference (/a/ in blue color, /i/ in red color).

3.3 Real-time Spectrogram Training

Real-time spectrogram with visual feedback to the singer provides majors advantages to acoustic assessment of singer’s voice. The figure 7 provides a real-time narrowband spectrogram display, illustrating vibrato with regular harmonic pattern in the formant range. The real-time wideband spectrogram as in figure 8 is characterized by singer’s formant with highly periodic fundamental frequency.

Fig. 7. Real-time narrowband spectrogram display is useful to show vibrato. (From Dr. Speech)

Fig. 8. Wideband spectrogram display in real-time is useful to show singer’s formant. (From Dr. Speech)

Acknowledgments

Research for voice disorders has been supervised by Prof. H. Kasuya at Utsunomiya University in Japan. Research for speech acoustics has been supervised by Prof. Fred Minifie at The University of Washington in USA.

References

Huang, Z., Minifie, F., Kasuya, H., & Lin, X. (1995). Measures of vocal function during change in vocal effort level, Journal of Voice, 9(4): 429-438.

Miller, J.D. (1959). Nature of the vocal cord wave. Journal of Acoustic Society of American 31: 667-677.

Minifie, F., Hixon, T. J., & Williams, F. (1973). Normal Aspects of Speech, Hearing, and Language. Prentice-Hall, Inc.

 

 

Relationship Between Acoustic Measures of Voice and Judgments of Voice Quality

Daniel Zaoming Huang, Ph.D.

The goal of the present study was to develop non-invasive techniques for the assessment of voice disorders and for monitoring the effects of voice. However, before acoustic algorithms can be used in clinical applications, it is important to understand the relationship between clinical perceptions of voice quality and the acoustic measures obtained from normal and pathological voice signals. This section focuses on how quantitative acoustic parameters predict qualitative perceptual judgments of voice quality (e.g., perceived degree of breathiness, harshness, and hoarseness).

1. Purposes

Two basic approaches exist to studying of the relationship between acoustic parameters and perceptual dimensions: Analysis-Perception approach, and Synthesis-Perception approach.

This study employed the Synthesis-Perception approach. The first step required in this approach is to develop a voice synthesizer capable of: (1) producing natural sounding vowels, (2) simulating normal voice production, and (3) simulating pathological voice production (Huang, & Hu, 1988; Huang, et al., 1992; 1994; Kasuya, 1989; 1990; Orlikoff, & Huang, 1991). Such a synthesizer has been developed as a part of this study. Included in the voice synthesis system are algorithms permitting F0 variation, period perturbation (jitter), amplitude perturbation (shimmer), simulated magnitudes of glottal noise (NNE), spectral tilt and so on. Each of these parameters can be varied relatively independently. This capability allows the investigator to study the perceptual consequences of varying only one acoustic parameter at a time.

The second step in this research was to compare the acoustic properties of synthesized vowels with judgments of voice quality. Thus, this study was designed to investigate the relationships among selected acoustic measures (F0, jitter, shimmer, glottal noise, spectral tilt, and formant flutter) and voice quality judgments (breathiness, harshness, and hoarseness).

More specially, this study systematically investigated six aspects of acoustic-perceptual relationships during vowel production.

  1. How judgments of breathiness, harshness, and hoarseness relate to fundamental frequency (F0).
  2. How judgments of breathiness, harshness, and hoarseness relate to the cycle-to-cycle perturbations of fundamental frequency (jitter).
  3. How judgments of breathiness, harshness, and hoarseness are influenced by waveform amplitude perturbations (shimmer).
  4. How judgments of breathiness, harshness, and hoarseness correspond to the richness in spectral harmonics (spectral tilt).
  5. How judgments of breathiness, harshness, and hoarseness correspond to the amount of glottal noise (NNE) included in the vowel signal.
  6. How judgments of breathiness, harshness, and hoarseness relate to the formant flutter (FL).

These six acoustic measures are assumed to reflect a direct relationship to laryngeal function during voice production. The importance of these relationships has led otolaryngologists and voice clinicians to consider using these non-invasive acoustic measures to obtain diagnostically significant information and to monitor the effects of voice therapy. The potential application of these acoustic measures may be enhanced once it is understood how they relate to judgments of voice quality.

2. Development of a voice synthesizer

Since early in 1988, we have been developing a software-based voice synthesizer. The synthesizer is designed on the basis of Klatt’s glottal-source model. Figure 1 presents a block diagram of the voice synthesizer used in these experiments.

Figure 1. Block Diagram of Pathological Voice Production Model

By controlling selected acoustic and/or physiologic parameters in this special synthesizer, it is possible to: (1) simulate normal voice production, (2) simulate some types of pathological voice production, and (3) investigate which of these parameters are appropriate for use in the analysis of the pathological voices. This last issue requires comparisons between acoustic variations in the synthesized vocalizations and voice quality judgments.

With this voice synthesizer, each vowel can be synthesized with a specified amount of each acoustic parameter (for example, F0, jitter, shimmer, glottal noise, spectral tilt, formant flutter). Therefore, it is possible to specify a desired combination of acoustic parameters and the magnitude of each parameter. The control parameters are grouped into three parts: glottal source, pitch contour control, and formant frequencies. The control parameters of the voice synthesizer are displayed in the Figure 2.

Figure 2. Control Parameters

The terms identified below are defined at length in Appendix A. In the pitch contour part of the voice synthesizer, there are six ways in which to control the pitch contour: (1) linear flat (no change in F0 - just specify desired F0), (2) linear increase in F0 - specify the beginning and ending F0, (3) linear decrease in F0 - specify the beginning and ending F0, (4) linear broken F0 - allows the investigator to design various F0 contours by specifying the onset F0, the minimum F0, the maximum F0 and the rise-time (RT) in ms to go from the onset F0 to the maximum F0 and the fall-time (FT) in ms to go from the maximum F0 to the minimum F0. The difference between the RT + FT and the maximum data length (ms) will be the amount of time the voice signal will remain at maximum F0, (5) archetypal F0 will provide a typical rise-fall contour that roughly follows changes in subglottal pressure, and (6) a choice of one of the 4 tonal patterns used in Mandarin Chinese (with a built-in dynamic F0 contour): flat, rise, fall-rise, fall. In each of these tonal patterns a minimum and maximum F0 can be specified for each vowel being simulated. In the glottal source section of the voice synthesizer, the experimenter can specify the sampling frequency, maximum data length (total duration of the synthesized vowel), pitch flutter for simulating jitter, amplitude of voicing (AV), open quotient (OQ), spectral tilt (ST), amplitude flutter (FA) for simulating shimmer, voice rising time (VRT), voice falling time (VFT), maximum of voltage (MOV), low frequency gain (LFG), high frequency gain (HFG), high frequency gain begin (HFB) and high frequency gain end (HFE). In the formant frequency part of synthesizer, there are controls for: formant frequency (F1, F2, F3, F4, F5), formant bandwidth (B1, B2, B3, B4, B5), the number of formants and formant flutter. This easy-to-use voice synthesizer provides a flexible tool for generating realistic sounding vowels.

3. Synthesis-Perceptual approach: Experiment

The purpose of the Experiment was to generate a series of synthetic vowels where only one acoustic parameter at a time would be changed, and then evaluate the effects of these changes on judgments of breathiness, harshness, and hoarseness.

1) Stimuli

Thirty six tokens of sustained /ae/ as in "bat" were synthesized for this experiment. The five formants selected for /ae/ were 660, 1720, 2410, 3500, 4400 Hz (Peterson, & Barney, 1952). The five bandwidths were 75, 75, 110, 120, 120 Hz, respectively. The 500-ms duration vowel stimuli were synthesized with equal 60dB overall RMS sound pressure levels (with a 40-ms voice amplitude rising time and a 40-ms voice amplitude falling time). The stimuli were synthesized at a 44,100 Hz sampling frequency with a 16-bit resolution.

Six groups of stimuli were synthesized for use in this study (refer to Table 1 for values of F0, jitter, shimmer, NNE, spectral tilt, and formant flutter created for the Experiment). The first group has six synthesized samples of /ae/, each having a different fundamental frequency (F0). The F0 ranged from 100 Hz to 150 Hz with step sizes of 10 Hz. The second group has eight synthesized samples of /ae/, each having different magnitudes of F0 jitter levels. The third group had eight synthesized samples of /ae/ with a maximum amplitude of 10000 points. Each token was produced at a different shimmer level. The fourth group had eight synthesized samples of /ae/ with a 60 dB amplitude of voicing. Each token was produced with a different glottal noise energy level. The fifth group of stimuli included three synthesized samples of /ae/, each having a different spectral tilt ranging from 0 to 6 dB with a step size of 3 dB. The sixth group of stimuli had three synthetic vowel samples, each having different values of formant flutter (ranging from 5 to 15% with a step size of 5%). Thus, a total of 36 synthetic vowels were generated with Dr. Speech Science for Windows software (Tiger Electronics, Inc.). These stimuli were stored on the computer’s hard disk and on high-quality audio tape (CrO2).

Table 1 : Acceptable levels for six control parameters in voice synthesis

Level

Group 1
F0(Hz)
Group 2
Jitter (%)
Group 3
Shimmer
(%)
Group 4
NNE (dB)
Group 5
Spectral
Tilt (dB)
Group 6
Formant
Flutter (%)

1

100

0.00

0.51

-23.63

0

5

2

110

0.35

0.77

-22.25

3

10

3

120

0.51

1.36

-20.25

6

15

4

130

0.80

1.92

-17.35

   

5

140

1.02

2.67

-13.96

   

6

150

1.35

3.28

-10.64

   

7

 

1.79

3.93

-7.33

   

8

 

2.06

4.84

-4.82

   

2) Presentation of Stimuli

The 36 vowel tokens were randomly arranged for presentation to listeners. Each vowel token was played three times, with an interstimulus interval of 1 second. There was 3.0 seconds of silence included between each block of three repeated vowel presentations to allow listeners sufficient time to rate the voice quality of the stimulus. The order of presentation of stimuli is schematized below.

......
Silent Duration 3 s
Vowel (i), silence 1 s; Vowel (i), silence 1 s; Vowel (i)
Silent Duration 3 s
Vowel (j), silence 1 s; Vowel (j), silence 1 s; Vowel (j)
Silent Duration 3 s
Vowel (k), silence 1 s; Vowel (k), silence 1 s; Vowel (k)
Silent Duration 3 s
......

3) Listeners and perceptual ratings

Eight laryngologists served as listeners for this experiment. Three were from Shanghai University of Technology; and four were from Shanghai ENT hospital, and one from a private clinic in New York City. These eight listeners (45-55 years old) are well-trained laryngologists. Each of the listeners had extensive clinical experience with voice disorders in hospital setting. It was reasoned that these clinical laryngologists would be well qualified to judge the voice qualities of breathiness, harshness and hoarseness.

The seven laryngologists in Shanghai were required to perform the listening tasks in a quiet room. The stimuli were presented through two audio speakers at a comfortable loudness level. The seven listeners were required to judge only one voice quality at a time. First, they were required to judge breathy voice quality. Next, they were asked to judge harsh voice quality. Finally, they judged hoarse voice quality. All listeners participated in a brief training period prior to the beginning of the experiment. The same rating form was used to evaluate all three of the perceptual dimensions: hoarse, harsh, and breathy voice quality. The rating form used a four-point, equal-appearing-intervals scale to rate each voice quality (Hirano, 1981; Kasuya, 1986). The values on the scale were: "0" normal, "1" slight, "2" moderate, and "3" extreme. After the listeners heard each block of three samples directly from the computer, they were required to check an appropriate answer on the rating form (see Appendix B for a copy of the rating form). The laryngologist in New York was required to perform same procedure from the audio recording.

4) Results

The results of this experiment are shown in the following figures and tables. The values reported are the mean ratings for breathiness, harshness, and hoarseness from the eight listeners.

a. Relationship between fundamental frequency and vocal quality

This first group of vowel stimuli consisted of six vowel samples, having different fundamental frequencies ranging from 100 Hz to 150 Hz in 10-Hz step, shown on the Table 2, while other parameters were maintained constant at normal levels (jitter = 0.3 %, shimmer = 1 %, glottal noise energy = 50 dB, amplitude of voicing = 60 dB, spectral tilt = 0 dB, and formant flutter = 0 %). The mean hoarseness, harshness and breathiness ratings from eight listeners across the six different fundamental frequency (F0) levels are shown in Table 2.

Table 2. Means and Standard Deviations of voice quality ratings at different F0 levels

F0 Level

Hoarseness
Mean (Std. Dev)
Harshness
Mean (Std. Dev)
Breathiness
Mean (Std. Dev)

1 (100 Hz)

0.50 (0.50)

0.88 (0.33)

0.00 (0.00)

2 (110 Hz)

0.50 (0.50)

0.63 (0.48)

0.25 (0.43)

3 (120 Hz)

0.38 (0.48)

0.50 (0.50)

0.38 (0.48)

4 (130 Hz)

0.25 (0.43)

0.50 (0.50)

0.38 (0.48)

5 (140 Hz)

0.25 (0.43)

0.38 (0.48)

0.50 (0.50)

6 (150 Hz)

0.13 (0.33)

0.13 (0.33)

0.75 (0.43)

A one-way analysis of variance (ANOVA) was used to determine if the voice quality ratings differed significantly across the six fundamental frequency (F0) levels. For hoarse voice quality, no significant differences existed among the six F0 levels (p<=0.05), and a Tukey post hoc test of all pairwise differences among the means indicated no significant differences. For harsh voice quality, no significant main effect existed among the six F0 levels (p<=0.05), but a Tukey post hoc test of all pairwise differences among the means indicated a significant difference between the mean ratings for level 1 (F0 = 100 Hz) and level 6 (F0 = 150 Hz). For breathy voice quality, no significant differences existed for six levels of F0 (p<=0.05), but a Tukey post hoc test of all pairwise differences in means indicated a significant difference between level 1 (F0 = 100 Hz) and level 6 (F0 = 150 Hz).

Bar display of breathy, harsh, and hoarse ratings at different fundamental frequency levels are shown in the Figure 3.

Figure 3. Hoarseness, harshness and breathiness ratings

at different fundamental frequency levels for synthetic /ae/ tokens.

A series of Pearson Product Moment correlations were computed to show the strength and direction of the relationship among the manipulated F0 and the hoarseness, harshness and breathiness judgments. A correlation matrix showing the interrelationships among F0 and hoarseness, harshness and breathiness judgments is shown in Table 3. 

Table 3. Pearson Product Moment correlations among voice quality judgments and F0

 

Hoarseness

Harshness

Breathiness

F0

-0.285

-0.439

0.454

The trends obtained from the statistical analyses of the effects of changes in fundamental frequency on voice quality judgments of voice quality are:

  1. No significant changes in perceived hoarseness apparent as a function of F0 changes.
  2. A weak trend was apparent for low pitched vowels to be perceived as more harsh than high pitched vowels.
  3. A weak trend was apparent for high pitched vowels to be perceived as more breathy than low pitched vowels.

b. Relationship between pitch period perturbation (jitter) and vocal quality

This group of vowel stimuli consisted of eight samples, each having different period perturbation (jitter), shown on the Table 4, while other parameters were maintained at normal levels (F0 = 125 Hz, shimmer = 1 %, glottal noise energy = 50 dB, amplitude of voicing = 60 dB, spectral tilt = 0 dB, and formant flutter = 0 %). The mean hoarseness, harshness and breathiness ratings from the eight listeners across different jitter levels are shown in Table 4.

Table 4. Means and Standard Deviations of voice quality ratings at different jitter levels

Jitter Level

Hoarseness
Mean (Std. Dev)
Harshness
Mean (Std. Dev)
Breathiness
Mean (Std. Dev)

1 (0.00 %)

0.25 (0.43)

0.25 (0.43)

0.13 (0.33)

2 (0.35 %)

0.50 (0.50)

0.63 (0.48)

0.25 (0.43)

3 (0.51 %)

1.13 (0.33)

1.25 (0.43)

0.25 (0.43)

4 (0.80 %)

1.38 (0.48)

1.50 (0.50)

0.25 (0.43)

5 (1.02 %)

2.00 (0.50)

2.13 (0.60)

0.38 (0.48)

6 (1.35%)

2.38 (0.48)

2.25 (0.43)

0.38 (0.48)

7 (1.79 %)

2.63 (0.48)

2.75 (0.43)

0.50 (0.50)

8 (2.06 %)

2.88 (0.33)

3.00 (0.00)

0.50 (0.50)

A one-way ANOVA was used to test whether voice quality rating differed significantly across the eight synthesized jitter levels. For hoarse voice quality, significant differences were present among the eight jitter levels (p<=0.05), and a Tukey post hoc test of all pairwise differences in means indicated significant differences among most of pairs. Significant differences in harshness were present among the eight jitter levels (p<=0.05), and a Tukey post hoc test of all pairwise differences in means indicated significant difference differences among most of pairs. However, for breathy voice quality, no significant differences were present among the eight jitter levels (p<=0.05), and a Tukey post hoc test of all pairwise differences in means indicated no significant difference.

Bar display of hoarse, harsh, and breathy rating at different jitter levels are shown in Figure 4.

Figure 4. Hoarseness, harshness and breathiness ratings

at different jitter levels for synthetic /ae/ tokens.

A series of Pearson Product Moment correlations were computed to show the strength and direction of the relationship among the manipulated jitter and the hoarseness, harshness and breathiness judgments. A correlation matrix showing the interrelationships among jitter and hoarseness, harshness and breathiness judgments is shown in Table 5.

Table 5. Pearson Product Moment correlations among voice quality judgments and jitter

 

Hoarseness

Harshness

Breathiness

Jitter

0.892

0.894

0.254

The conclusions that may be drawn from the analyses reported in this section are:

  1. The magnitude of period perturbation (jitter) was directly related to the perceived magnitude of both hoarseness and harshness.
  2. No significant relationship existed between the perception of breathy voice quality and jitter levels.

c. Relationship between amplitude perturbation (shimmer) and vocal quality

This group of /ae/ vowel stimuli consisted of eight samples, each having a different amplitude perturbation (shimmer), as shown in Table 6, while other parameters were maintained at normal levels (F0 = 125 Hz, jitter = 0.3 %, glottal noise energy = 50 dB, amplitude of voicing = 60 dB, spectral tilt = 0 dB, formant flutter = 0 %). The mean hoarseness, harshness and breathiness rating from the eight listeners across the different shimmer levels are shown in Table 6.

Table 6. Means and Standard Deviations of voice quality ratings at different shimmer levels

Shimmer Level

Hoarseness
Mean (Std. Dev)
Harshness
Mean (Std. Dev)
Breathiness
Mean (Std. Dev)

1 (0.51 %)

0.13 (0.33)

0.13 (0.33)

0.13 (0.33)

2 (0.77 %)

0.25 (0.43)

0.13 (0.33)

0.13 (0.33)

3 (1.36 %)

0.88 (0.33)

0.50 (0.50)

0.25 (0.43)

4 (1.92 %)

1.75 (0.66)

0.50 (0.50)

0.25 (0.43)

5 (2.67 %)

2.13 (0.60)

0.63 (0.48)

0.38 (0.48)

6 (3.28 %)

2.75 (0.43)

0.63 (0.48)

0.63 (0.48)

7 (3.93 %)

2.63 (0.48)

0.63 (0.48)

0.50 (0.50)

8 (4.84 %)

3.00 (0.00)

1.13 (0.60)

0.63 (0.48)

A one-way ANOVA was used to determine if voice quality ratings differed significantly across the eight synthesized shimmer levels. For hoarse voice quality, significant perceptual differences were present among eight levels (p<=0.05), and a Tukey post hoc test of all pairwise differences in means indicated significant differences among most of pairs. Harsh voice quality was significantly affected by the eight shimmer level (p<=0.05), and a Tukey post hoc test of all pairwise differences in means indicated a significant difference between level 1 (Shimmer = 0.51 %) and level 7 (Shimmer = 3.93 %), and level 1 (Shimmer = 0.51 %) and level 8 (Shimmer = 4.84 %). For the breathy voice quality, no significant differences were apparent among eight shimmer levels (p<=0.05), and a Tukey post hoc test of all pairwise differences in means indicated no significant difference.

Bar display of hoarseness, harshness and breathiness rating at different shimmer levels are shown in Figure 5.

Figure 5. Hoarseness, harshness and breathiness ratings

at different shimmer levels for synthetic /ae/ tokens.

A series of Pearson Product Moment correlations were computed to show the strength and direction of the relationship among the manipulated shimmer and the hoarseness, harshness and breathiness judgments. A correlation matrix showing the interrelationships among shimmer and hoarseness, harshness and breathiness judgments is shown in Table 7.

Table 7. Pearson Product Moment correlations among voice quality judgments and shimmer

 

Hoarseness

Harshness

Breathiness

Shimmer

0.895

0.489

0.377

The major findings from the analysis in this section are summarized as follows:

  1. The magnitude of amplitude perturbation (shimmer) in a synthesized vowel appears to be highly related with perceived hoarseness, but not to perceived breathiness.
  2. Very high shimmer levels appear to sound more harsh than very low shimmer levels for synthetic /ae/ tokens.

d. Relationship between glottal noise energy (NNE) and vocal quality

This group of synthetic /ae/ stimuli consisted of eight samples, each having a different level of glottal noise energy (NNE), shown in Table 8, while other acoustic parameters were maintained at normal levels (F0 = 125 Hz, jitter = 0.3 %, shimmer = 1 %, spectral tilt = 0 dB, and formant flutter = 0 %). The mean hoarseness, harshness and breathiness ratings of eight listeners for the different NNE levels are shown in Table 8.

Table 8. Means and Standard Deviations of voice quality ratings at different NNE levels

NNE Level

Hoarseness
Mean (Std. Dev)
Harshness
Mean (Std. Dev)
Breathiness
Mean (Std. Dev)

1 (-23.63 dB)

0.13 (0.33)

0.00 (0.00)

0.00 (0.00)

2 (-22.25 dB)

0.25 (0.43)

0.25 (0.43)

0.25 (0.43)

3 (-20.25 dB)

0.75 (0.43)

0.13 (0.33)

1.13 (0.60)

4 (-17.35 dB)

1.38 (0.48)

0.38 (0.48)

1.63 (0.48)

5 (-13.96 dB)

1.88 (0.33)

0.25 (0.43)

2.38 (0.48)

6 (-10.64 dB)

2.25 (0.43)

0.50 (0.50)

2.38 (0.48)

7 (-7.33 dB)

2.50 (0.50)

0.75 (0.43)

2.75 (0.43)

8 (-4.82 dB)

2.88 (0.33)

0.75 (0.43)

3.00 (0.00)

A one-way ANOVA was used to determine whether hoarseness, harshness and breathiness ratings differed significantly across the eight NNE levels. Hoarseness, and breathiness ratings were significantly different among the eight NNE levels (p<=0.05), and a Tukey post hoc test of all pairwise differences in means indicated significant difference among most of pairs. Harshness ratings were significantly different among the eight NNE levels (p<=0.05), and a Tukey post hoc test of all pairwise differences in means indicated significant differences between level 1 (NNE = -23.63 dB) and level 7 (NNE =-7.33 dB), and level 1 (NNE = -23.63 dB) and level 8 (NNE = -4.82 dB).

Bar display of hoarseness, harshness and breathiness ratings at different NNE levels are shown in Figure 6.

Figure 6. Hoarseness, harshness and breathiness ratings

at different NNE levels for synthetic /ae/ tokens.

A series of Pearson Product Moment correlations were computed to show the strength and direction of the relationship among the manipulated NNE and the hoarseness, harshness and breathiness judgments. A correlation matrix showing the interrelationships among NNE and hoarseness, harshness and breathiness judgments is shown in Table 9.

Table 9. Pearson Product Moment correlations among voice quality judgments and NNE

 

Hoarseness

Harshness

Breathiness

NNE

0.913

0.493

0.906

The findings from the analyses in this section are summarized as follows:

  1. Glottal noise energy (NNE) significantly influences the perception of breathy and hoarse voice qualities.
  2. Very high NNE levels appear to sound more harsh than very low NNE levels for synthetic /ae/ tokens.

e. Relationship between spectral tilt (ST) and vocal quality

This group of synthetic /ae/ stimuli consisted of three samples, each having a different spectral tilt (ST) level, ranging from 0 to 6 dB with a step size of 3 dB. Other acoustic parameters were maintained at normal levels (F0 = 125 Hz, jitter = 0.3 %, shimmer = 1 %, glottal noise = 50 dB, amplitude of voicing = 60 dB, and formant flutter = 0 %). The mean hoarseness, harshness and breathiness ratings from eight listeners at three spectral tilt levels are shown in Table 10.

Table 10. Means and Standard Deviations of voice quality ratings at different ST levels

Spectral Tilt Level

Hoarseness
Mean (Std. Dev)
Harshness
Mean (Std. Dev)
Breathiness
Mean (Std. Dev)

1 (5 %)

0.25 (0.43)

0.13 (0.33)

0.38 (0.48)

2 (10 %)

1.00 (0.00)

0.38 (0.48)

1.63 (0.48)

3 (15 %)

2.00 (0.50)

0.63 (0.48)

2.38 (0.48)

A one-way ANOVA was used to test whether voice quality ratings differed significantly across simulated spectral tilt levels. For hoarse and breathy voice quality, significant differences were apparent among three spectral tilt levels (p<=0.05), and a Tukey post hoc test of all pairwise differences in means indicated significant differences among all pairs. For harsh voice quality, no significant differences existed among three spectral tilt levels (p<=0.05), and a Tukey post hoc test of all pairwise differences in means indicated no significant difference.

Bar display of hoarseness, harshness and breathiness at different spectral tilt levels are shown in Figure 7.

Figure 7. Hoarseness, harshness and breathiness ratings
at different spectral tilt levels for synthetic /ae/ tokens.

A series of Pearson Product Moment correlations were computed to show the strength and direction of the relationship among the manipulated spectral tilt and the hoarseness, harshness and breathiness judgments. A correlation matrix showing the interrelationships among spectral tilt and hoarseness, harshness and breathiness judgments is shown in Table 11.

Table 11. Pearson Product Moment correlations among voice quality judgments and ST

 

Hoarseness

Harshness

Breathiness

Spectral Tilt

0.880

0.422

0.854

The major findings in this section are:

1. Hoarse and breathy voice qualities increased in direct proportion to spectral tilt magnitude.
2. No relationship existed between spectral tilt and harsh voice quality.

f. Relationship between formant flutter (FF) and vocal quality

This group of vowel stimuli consisted of three samples, each having a different formant flutter (FF) levels, ranging from 5 to 15 % with a step size of 5%, while other acoustic parameters were maintained at normal levels (F0 = 125 Hz, jitter = 0.3 %, shimmer = 1 %, glottal noise = 50 dB, amplitude of voicing = 60 dB, and spectral tilt = 0 dB). The mean hoarseness, harshness and breathiness ratings from eight listeners at the different formant flutter levels are shown in Table 12.

Table 12. Means and Standard Deviations of voice quality ratings at different FF levels

Formant Flutter
Level
Hoarseness
Mean (Std. Dev)
Harshness
Mean (Std. Dev)
Breathiness
Mean (Std. Dev)

1 (5 %)

0.38 (0.48)

0.50 (0.50)

0.25 (0.43)

2 (10 %)

0.50 (0.50)

0.75 (0.66)

0.50 (0.50)

3 (15 %)

0.88 (0.78)

1.13 (0.60)

0.38 (0.48)

A one-way ANOVA was used to test if the perceptual ratings of the voice qualities differed significantly across the three formant flutter levels. For hoarse, harsh and breathy voice quality ratings, no significant differences existed for the three formant flutter levels (p<=0.05), and a Tukey post hoc test of all pairwise differences in means indicated no significant differences. Scatter plots of hoarseness, harshness and breathiness at different spectral tilt levels are shown in Figure 8.

A series of Pearson Product Moment correlations were computed to show the strength and direction of the relationship among the manipulated formant flutter and the hoarseness, harshness and breathiness judgments. A correlation matrix showing the interrelationships among formant flutter and hoarseness, harshness and breathiness judgments is shown in 13.

Table 13. Pearson Product Moment correlations among voice quality judgments and FF

 

Hoarseness

Harshness

Breathiness

Formant Flutter

0.319

0.396

0.105

The major findings in this section may be summarized as follows:

Formant flutter does not appear to significantly influence the perceptions of hoarse, harsh, and breathy voice quality during production of these simulated /ae/ vowel.

g. Correlation between acoustic parameters and perceptual judgments

A series of Pearson Product Moment correlations were computed to show the strength and direction of the relationship between each of the acoustical parameters manipulated in this experiment and perceptual judgments of hoarseness, harshness and breathiness. A correlation matrix showing the interrelationships among all of the acoustical parameters (F0, jitter, shimmer, NNE, spectral tilt, and formant flutter) and perceptual judgments (hoarseness, harshness and breathiness) is shown in Table 14.

Table 14. Pearson Product Moment correlations among all of the acoustic parameters measured and perceptual judgments

 

Hoarseness

Harshness

Breathiness

F0

-0.285

-0.439

0.454

Jitter

0.892

0.894

0.254

Shimmer

0.895

0.489

0.377

NNE

0.913

0.493

0.906

Spectral Tilt

0.880

0.422

0.854

Formant Flutter

0.319

0.396

0.105

4. Conclusion

Regarding the interpretation of acoustic parameters, our tentative conclusions from this study are:

  1. Vocal jitter appears to be related primarily to harsh voice quality, and secondarily to hoarse voice quality. This conclusion is similar to the findings of Minifie, Huang and Green (1994).
  2. The magnitude of glottal noise energy closely corresponds to the breathy voice quality, and thus also to the perception of hoarse voice quality which is presumed to be a combination of breathiness and harshness. This conclusion is similar to the findings of Minifie, Huang and Green (1994).
  3. As indicated above, the acoustic features of jitter and glottal noise energy appear to interact in influencing perceived hoarseness. Therefore, the hoarseness perception is correctly considered as some combination of harsh and breathy voice quality. This finding directly supports the view of Fairbanks (1960). Consequently, it was not surprising to find that both jitter and glottal noise energy were correlated with hoarseness.
  4. Shimmer appears to be the primary influence on hoarse voice quality.
  5. The spectral tilt of the glottal source is significantly related to the perceived breathiness. Because of the dependence of hoarseness judgments on the presence of glottal noise, it follows that spectral tilt would also be related to the perception of hoarse voice quality, as indicated by present results.
  6. Formant frequency perturbations do not appear to be related to perceptual judgments of hoarse, harsh or breathy voice quality.

 

Home | Dr. Speech 3 |Dr. Speech 4 | Distributors | Information | Contact us

  Tiger DRS Inc. 1998