Home | Dr. Speech 4 | Dr. Speech 5 | Distributors | Information | Contact us
Voice Lab in Clinical PracticeSpeech Skill Builder for Children
Use and Understanding Voice Lab for Singers Relationship Between Acoustic Measures of Voice and Judgments of Voice Quality RespirationVoice Lab in Clinical Practice
1. Introduction
Computer, with clinical software, provides valuable assistance during assessment and treatment of voice disorders. Demographic information, such as names, address, number of visits, types of disorders, progress reports, insurance claim, etc. are easy to display. When such clinical software is installed in a laptop computer, it provides clinicians with a "portable clinical voice laboratory" equipped with powerful clinical tools that easily can be carried from one treatment location to another. Todays practitioners can benefit from the use of clinical software that has been adapted to the needs of clinicians.
Acoustic analysis provides quantitative assessment of voice quality and vocal function. EGG measure gives non-invasive objective information on the contact behavior of vocal fold vibration. Acoustic, and electroglottographic (EGG) measures should be used in the routine clinical examination, as substantially complements endoscopy examination. All of three measures can be done by the "Dr. Speech" and "ScopeView" software from Tiger DRS, Inc., and help laryngologists and voice pathologists make a correct diagnosis of voice disorders and/or monitor progress in voice therapy.
2. Methods and Procedures
This new technology is being introduced to help perform important diagnosis and therapy procedures for laryngeal disorders by using multimedia technology with application in endoscope. Such technology enables outpatient procedures to gather a wide array of information not only for laryngologists but also for voice pathologists, thus making it easier to work together. This paper introduces a power tool designed to enhance clinical observation of vocal fold vibration while simultaneously showing electroglottographic and acoustic waveform, as shown in the figure 1.

2.1 Objective Measures of Acoustic and EGG
This technology provides not only measures of F0, intensity, jitter, shimmer, glottal noise, ratio, tremor, statistics and spectrum from voice signals, but also measures of EGG-pitch, EGG-intensity, EGG-jitter, EGG-shimmer, EGG-noise, CQ, CQ perturbation, CI, CI perturbation, opening rate, and closing rate from EGG signals. The figure 2 is a real-time display of acoustic and EGG display with power spectrum and contact quotient analyses. A number of parameters can be got after recording immediately. There parameters can be of considerable use in arriving at a diagnosis for evaluating the effects of surgery, or for tracking progress during voice therapy.

2.2 Objective Observation of Vocal Fold Vibration
Glottal closure patterns such as mucosal wave, tissue elasticity and other irregularities can be easily observed in real-time with help from this technology. Our clinical experience using this technology to observe single and multiple consecutive images of the vocal fold vibration indicates that it is easy to inspect the physiological and pathological status of the larynx and to monitor the effect of voice therapy. Digital processing, such as editing, enlarging, removing noise, smoothing, and sharpening has been applied to enhance the image clarity for the clinicians to view and analyze the patients larynx carefully. The figure 3 shows a multiple-frame viewing feature with glottal area detection (yellow color).

3. Results
3.1 Voice Disorders Database and Clinical Implication of Acoustic and EGG Parameters
The figure 4 provides an example of a voice and EGG profile for a papilloma patient after surgery (male, age 42). Top window shows ten voice and EGG parameters from the patient, and the normal range of these ten parameters. Bottom window is an estimated vocal function (regularity of vocal fold vibration and glottal closure time) and an estimated voice quality (harsh, breathy and hoarse). The figure 5 show F0, intensity and spectrogram for the acoustic signal, and vocal fold contact behavior from the EGG signal. Clinical implication about these parameters are described as (1) Hoarseness can be considered as a combination of breathiness and harshness, (2) Jitter appears to be related primarily to harsh voice quality, (3) Shimmer appears to be the primary influence on hoarse voice quality, (4) The magnitude of glottal noise energy (NNE) closely corresponds to the breathy voice quality, (5) EGG measures are chiefly related to laryngeal behavior that occurs during vocal fold contact, (6) CQ reveals information about degree of glottal closure, and CI gives information about symmetry of vocal fold vibration, and (7) CQP and CIP reveals the regularity of vocal fold vibration.

Sustained phonation taken from a large amount of normal (2,937) and patients with voice disorders (902) were recorded and stored digitally to create the Voice Disorders Database. Figure 6 shows a database of normal vs. recurrent laryngeal nerve paralysis group by NNE. Figure 7 is a database of normal vs. glottic cancer group by using jitter and NNE.

Table. Discrimination Between Normal and Pathological Voices
Laryngeal Pathology |
|
|
Total |
Normal |
2685 |
252 |
2937 |
Glottic cancer T1 |
13 |
49 |
62 |
Glottic cancer T2 |
1 |
23 |
24 |
Glottic cancer T3-4 |
0 |
18 |
18 |
Vocal fold polyp |
26 |
52 |
78 |
Vocal fold polypoid |
21 |
55 |
76 |
RLN paralysis |
5 |
30 |
35 |
To evaluate the measures in distinction between normal and pathological voices from acoustic and EGG measures, the experiment was made as in Table 1. A multiple value was obtained from several acoustic and EGG parameters. The voice sample was regarded as normal if the multiple values was smaller than a threshold, and as pathological if it was larger than, or equal to, the threshold.
3.2 Quantitative Information from Video Images with Acoustic and EGG Signals
Glottal area profile of complete four cycles of vocal fold vibration is provided in the figure 8. Some important parameters, such as open quotient (OQ) can be obtained from this glottal area profile. With practice, perceptual judgment of video images provides a great deal of information, but such quantitative information warrants conclusions because vibrations depend on F0, intensity and vocal register. F0 is increased with increasing vocal fold tension or stiffness, and decreased as vocal fold mass increases.

In figure 9, objective information about the ratio of vocal fold length and width (RLW) and the ratio of glottal area height and width (RHW) during inspiration is useful to draw correct conclusion. Table 2 provides objective data for eight patients with unilateral left RLN paralysis, and shows a significant difference in ratio (RLW) between two vocal folds, higher open quotient (OQ) as well as lower F0.
Table 2. Quantitative Information for the patient with unilateral left RLN paralysis
Unilateral RLN paralysis |
RLW (left) |
RLW (right) |
RHW |
OQ (%) |
F0 (Hz) |
Subject 1 (male, 32 y) |
6.64 |
9.91 |
12.3 |
52.1 |
112 |
Subject 2 (male, 24 y) |
1.42 |
3.31 |
10.6 |
62.4 |
105 |
Subject 3 (male, 55 y) |
3.22 |
5.44 |
7.7 |
57.5 |
127 |
Subject 4 (male, 47 y) |
6.01 |
7.88 |
8.5 |
61.9 |
98 |
Subject 5 (male, 37 y) |
3.55 |
5.66 |
11.2 |
47.8 |
121 |
Subject 6 (female, 21 y) |
2.11 |
6.44 |
9.9 |
77.5 |
178 |
Subject 7 (female, 44 y) |
2.37 |
4.82 |
6.7 |
63.9 |
196 |
Subject 8 (female, 57 y) |
1.25 |
3.77 |
8.9 |
75.6 |
210 |
4. Conclusion
The quantitative information from video image, acoustic and EGG signals is very useful in order to perform a reliable vocal assessment and therapy pre-operatively and post-operatively. Documents and color images should be printed for patient records and insurance purpose for before and after surgery.
Acknowledgments
Research for voice disorders has been supported by Dr. Colin Watson, Prof.. A. Maran at The University of Edinburgh in UK, and Dr. Mara Behlau at Clinical Voice Center in Brazil. The first was surprised by Prof. H. Kasuya at Utsunomiya University in Japan and by Prof. Fred Minifie at The University of Washington in USA.
Their expertise has provided a positive input to the design of this new technology.References
Huang, Z., & Hu, N. (1988). Research for laryngeal cancer evaluation and diagnosis, Journal of Biomechanics, 3-2, 15-20.
Huang, Z., Minifie, F., Kasuya, H., & Lin, X. (1995). Measures of vocal function during change in vocal effort level, Journal of Voice, 9(4): 429-438.
Kasuya, H., Ogawa, S. & Kikuchi, Y. (1986). An acoustic analysis of pathological voice and its application to the evaluation of laryngeal pathology, Speech Communication, 5-2.
Speech Skill Builder for Children
Daniel Zaoming Huang, Ph.D., M.Engr.
Abstract
This paper uses voice-activated game-like tool to provide real-time reinforcement of a clients attempts to produce changes in pitch, loudness, voiced/unvoiced phonation, voicing onset, maximum phonation time, sound and vowel tracking. Children, in particular, enjoy therapy with this colorful, interactive, video game because they receive immediate feedback on their performance. Clinicians will enjoy the versatility and unique features of this technique. For example, while a child is playing a game, you can quickly review the graphical display or statistical data of the childs performance. This technique is divided into two groups: 1) Awareness teaches children about the attributes of their voice, and (2) Skill Builder gives the user goals to achieve for a given range and time. The examples of comprehensive user logs and tracking clients progress are provided. Best of all, real-time recording and playback gives you the tools you need to maximum your clients therapy.
1. Introduction
Innovative computer technologies are not only helping the needs of persons with speech disorders, but also serving the laryngologists and speech pathologists to perform a more accurate and professional service. This paper provides you a tool to start this challenge in a more efficient way. Not only for your needs to perform a better report or make therapy efficient, but also for the clients who are counting on you and your ability to treat them in the best way. With a PC desktop or laptop computer, a 16-bit sound card, a microphone and speakers, the clinician has met the simple requirements for starting and operating a speech laboratory in the clinical practice.
2. Methods and Procedures
Speech is a product of the interaction of respiratory, laryngeal and vocal tract structures. The larynx functions as a valve that connects the respiratory system to the airway passages of the throat, mouth and nose. The vocal tract system consists of the various passages from the glottis to the lips. It involves the pharynx, oral and nasal cavities, including the tongue, teeth, velum, and lips. The production of speech sounds through these organs is known as articulation.
For speech skill builder, it is necessary for us to focus on the acoustical and physiological phenomena in both laryngeal and vocal tract systems. The parameters, such as, pitch, loudness, voicing, voicing onset, phonation time and formants, are closely related to these two systems. This paper employs Speech Therapy program, a clinical software from Tiger DRS. This software provides real-time cartoon displays of continuously varying pitch, loudness, voicing, voicing onset and phonation time displays so the children can receive immediate feedback on his/her performance with fun. In other wards, the acoustical and physiological phenomena from the children can be evaluated from this technique. Clinical application of this technique will be described in the following experiments for details.
Experiment 1: Pitch Skill Builder
Using pitch module, clinicians could help the children refine pitch control and develop smooth modulation of pitch contour. Certain patients are unconsciously or consciously making an effort to higher or lower their pitch. The clinician should teach patient to target optimum pitch by the control of vocal fold vibration. For example, one of the best way to refine pitch control is to use rise-fall pitch technique. In the Figure 1 (a), by extending /a/ in front of microphone, the boat moves around the rocks based on a rise-fall pitch pattern. With this game, the children receive immediate feedback on their pitch performance. After the game, the clinicians can look at objective information of the pitch control, as shown in the Figure 1 (b). In clinical practice, the clinician may select different pitch patterns for different needs of the patients.

Pitch measure provides information about intonation. The pitch is mainly decided by the rate of vocal fold vibration. In the Pitch Skill Builder, the clinicians should help the patients to find the optimum pitch and pitch range and how to maintain this optimum situation. In clinical practice, a complete statistical report before or after therapy is important. The Table 1 lists the pitch changes during three-week therapy by pitch skill builder technique for the male patients with female voices. The result of speech therapy is obviously.
Table 1: Pitch changes during three-week therapy
Patient 1 (male, 12 y) |
Patient 2 (male 15 y) |
Patient 3 (male, 17 y) |
Therapy Technique |
|
|
282 Hz |
325 Hz |
208 Hz |
|
|
253 Hz |
287 Hz |
181 Hz |
|
|
232 Hz |
262 Hz |
157 Hz |
|
Experiment 2: Loudness Skill Builder
Using loudness module, clinicians could help the children lower the loudness level of speech when the usual level is higher, and higher the loudness level when the usual level is lower. The clinician should teach the patient to control his/her loudness change by the correct control of breathing. For example, one way to control loudness is to use correct control of breathing and body position. In the Figure 2 (a), by increasing the loudness through a good body position, the fireman climbs higher toward the top target. With this game, the children receive immediate feedback of loudness changes with their different body position (standing vs. sitting). After that game, the clinician can look at the different loudness data (standing vs. sitting), as shown in the Figure 2 (b). The top target corresponds to a certain loudness level that can be modified by the clinicians.

Loudness measure provides information about syllable stress. The intensity of vocal fold vibration is decided mainly by the loudness. In the Loudness Skill Builder, the clinicians should find the best way for the patients to make a target. The Table 2 lists the loudness changes during seven-week therapy by loudness skill builder technique for the patients with right RLN paralysis. The result of speech therapy is obviously.
Table 2: Loudness changes during seven-week therapy
|
|
|
Therapy Technique |
|
|
61.1 dB |
66.5 dB |
68.2 dB |
|
|
63.4 dB |
67.1 dB |
69.1 dB |
|
|
66.2 dB |
67.8 dB |
71.3 dB |
|
Experiment 3: Voicing Skill Building
Using voicing module could help the children assess their voiced and unvoiced phonation from the computer screen. Voicing refers to the vocal behavior by which the conversion of continuous airflow into a series of glottal pulses is regulated. Voiced phonation, such as /z/, is regulated by the vocal fold vibration, while voiceless phonation. such as /s/, is not regulated by the vocal fold vibration. For example, one way to feel voicing is to produce a pair of phoneme /s, z/, /f, v/ etc. In the Figure 3, when you phonate a voiced sound, a mouse (red) will come from left side; when you have a voiceless sound, a mouse (green) will appear from right side.

Voicing measure provides information about phonatory pattern. Using voicing onset module, clinicians could assist the children with modification of glottal attacks before the appearance of supraglottal articulatory event.
Experiment 4: Voicing Onset Skill Building
Using voicing onset module, the clinician can help the children to control the vocal fold attacks correctly. In the Figure 4, when you initiate a voiced phonation, a flower will open. If you saw /ba/, /po/, the first flower will open at the beginning of /b/, and the second flower will open at the beginning of /o/ because /p/ is a voiceless phoneme.

Voicing onset provides information about glottal attacks. How fast can you make the ten flower open ? What happens if you extend a vowel, but have voice breaks ? All these cases depend on the voicing onset.
Experiment 5: Phonation Time Skill Building
The term, Maximum Phonation Time (MPT), implies such abilities in voice production as how long one can sustain phonation. The patients are instructed to sustain vowel /a/ or other vowel as long as possible following deep inspiration. MPT is decreased in many pathological states of the larynx, especially in cases with incompetent glottal closure. MPT values smaller than 10 seconds should be considered to be abnormal. For example, the clinicians should provide the patients the best way to make the respiration and phonation correctly. In the Figure 5, the strawberry moves from left to right when you keep phonation after deep inspiration. The target for you to reach is at right side. The target setting can be changed for the needs of patients.

Experiment 6: Speech Articulation
Speech articulation within vocal tract is determined by three major factors: the place of major constriction, the degree of constriction at that point and the lip constriction, as in Figure 6 and Figure 7. The vocal tract shape and lip movement will be provided for each vowel and consonant. In clinical practice, a brief education about speech articulation (tongue and lip movement) should be provided before therapy.

Real-time vowel space training reveals first and second formants for speech inputs. With this tool, clinician can show patient about the effect of major constriction place in vocal tract from computer screen. The tongue tip movement mainly determines the second formant changes. For example, when the children produce a series of vowel /I-e-æ-a -u/, the vowel tracking will appear as in the Figure 8. By the graphic display, the clinician can judge the tongue tip position and phonetic accuracy quickly.
Experiment 6: Speech Articulation
Speech articulation within vocal tract is determined by three major factors: the place of major constriction, the degree of constriction at that point and the lip constriction, as in Figure 6 and Figure 7. The vocal tract shape and lip movement will be provided for each vowel and consonant. In clinical practice, a brief education about speech articulation (tongue and lip movement) should be provided before therapy.

Experiment 7: Sound Awareness
In Sound Awareness module, the children should understand normal speech level. Another important thing is to have children to understand difference among non-speech, speech, whistle and hiss. In the Figure 9, the clinicians can help patients to understand how much loudness or effect is necessary to move the graphic. The sound can be set to indicate a normal. conversational speech level. If you set it too high, you might not get the object to move at all.

3. Conclusion
The speech therapy demands from a hospital require the implementation of simple and well-defined therapy and assessment technique. Where pitch, intonation, stress, loudness and articulation are of primary interest, a good and efficient speech therapy tool, such as Speech Therapy software, is essential in clinical practice.
Acknowledgments
Research for voice disorders has been supported by Dr. Colin Watson in The University of Edinburgh (UK), and Prof. Wei Wang in Shanghai EENT Hospital (China). Research for speech acoustics has been advised by Prof. Fred Minifie in The University of Washington (USA). Their expertise has provided a positive input to the design of this technique.
Use and Understanding Voice Lab for Singers
Daniel Zaoming Huang, Ph.D.
1. Introduction
The success of opera made it necessary to seek out new singers and to develop singing training in order to secure growth and continuity in the new artform. To the singer, the voice appears to dissociate itself from the larynx and become present in the resonant resource of the vocal tract and within the acoustics of the opera house.
The different styles of classic, musical theater, and pop music, requires that the singer produce a wide range of tonal qualities. For classical singers, there is a demand to maximize the resonance of the voice and find a mode of production that will allow them to find an accommodation between resonance and clear diction over a wide working range of pitches. For music theater singers, the major demand appears to be related to direct communication of the text and, in this style, tonal quality will always take second place to diction and the word. In the pop field, the sound demand can arrange all the way from coarse and frantic, to mellow, laid back, and sentimental (Miller 1959). In this range of demands, the diction can go from completely unintelligible to crystal clear. The world of pop music covers a wide range of performance styles and within this diverse field, one can encounter performers whose skill levels range from the advanced to the untutored and technically inept.
Over the last decade, understanding of singer voice has rapidly advanced through the availability of computer based instrumentation. Since a wide variety of tonal and articulatory demands are reflected in the different vocal genres, it seems necessary to provide a quantitative procedure which can evaluate singing voice quality and monitor the effect of singing training in modern voice studio. The purpose of this paper is to provide an affordable portable voice lab to voice teachers and/or singers. Designed to enable voice teacher and singers to use the powerful techniques of voice analysis and training without incurring the costs of special hardware, Dr. Speech are Microsoft Window software system which make use of the standard multimedia capability of todays personal computers. With this affordable voice lab, some general guidelines can be given for identifying poor vocal production based on the vocal sound.
2. Vocal Parameters for Singers
Generally speaking, the vocal sound is acoustically characterized by diction, tone quality, range, pitch, vibrato and singers formant (Minifie, Hixon, & Williams 1973). Since the vibrato and singers formant might be instantly recognized by a singing teacher, these two features were investigated.
2.1 Vibrato
Vibrato is a 4-6 Hz tremor which appears gradually as a singer develops the neuromuscular ability to sustain vowels in a resonant vocal tract and against substantial transglottal pressure. Vibrato is an essential part of the musical quality of the voice and is not controllable other than by controlling sub-glottal pressure.
The acoustic features of vibrato (Huang, Minifie, Kasuya & Lin 1995) were investigated by Dr. Speech software. In figure 1, the narrow-band spectrogram is excellent for assessing the location of the available formants and observing the relative strength of the harmonics of the sung tone. The positioning of these formants establishes the identity of vowels and the musical quality of the singers voice. The spectrogram shows that the excessive amplitude of the vibrato is contained by the low formants. Vibrato is multiplied by the harmonic number and the first five harmonics show this increase.


Fig. 1. The narrow-band spectrogram shows a large vibrato as a quasi sinusoidal modulation of the harmonics. (From Dr. Speech) |
Fig. 2. The pitch and intensity displays show the vibrato cycle with negative and positive movements. (From Dr. Speech) |
In figure 2, a plot of pitch and intensity display clearly indicates the effect of vibrato as a means of seeking out formants in order to enhance the musical quality of the sung tone. This plot may be associated with the points where harmonic concur with formants in the negative and positive going sweeps of vibrato. It can be seen therefore that limited vibrato can maximize the singers access to the resonant resources of the vocal tract.
2.2 Singers Formant
In acoustic terms, the advantages gained by the singer with higher harmonic enhancement are very substantial. The human ear is relatively sensitive in 1000 Hz-5000 Hz. In effect, the singer with poor higher harmonic component support looses out in audibility in the opera house and may attempt to compensate by forcing the voice. The 2500 Hz-3500 Hz "Singers Formant" lies within a bandwidth which successfully clears the masking potential of the classical orchestra.
The acoustic features of singers formant were investigated by Dr. Speech software. The figure 3 provides the LPC spectral display. The first formant lies at 420 Hz. The second formant is 1840 Hz with the third formant at 2540 Hz. The fourth formant is around 3250 Hz which assists the third formant to establish a higher than normal higher harmonic strength in the case of low pitched opera singers. Adduction of the third and fourth formants is but one explanation for the singers higher harmonic achievements. The figure 4 shows the power spectrum of the sung vowel positioning against the harmonic distribution.


Fig. .3. LPC spectrum estimates formants and singers formant. (From Dr. Speech) |
Fig. 4. Power spectrum (LTAS) estimates harmonic distribution. (From Dr. Speech) |
3 Vocal Training in Singers Studio
The huge computer industry investment in sound feature provides dynamic development which is set to diminish the need for purchases to pay high costs of unique signal processing hardware. With an affordable Dr. Speech software, vocal training become simple.
3.1 Real-time F0 Training
The F0 training allows real-time pitch extraction from acoustical input. Model-matching feature can be used for target modeling. For singer and voice teacher, this function display the dynamic range of the human voice in term of F0. In figure 5, the F0 pattern of the instructor can be stored on the computer screen in blue color, and then student can compare the performance of an attempt to match the instructors pattern by tracing in red color.


Fig. 5. In the model-matching mode, real-time pitch training is useful for both vocal training and voice teaching. (From Dr. Speech) |
Fig. 6. Real-time plotting of formant is useful to show singers formant for both vocal training and voice teaching. (From Dr. Speech) |
3.2 Real-time Formant Training
Real-time formant display (or call LPC spectrum) graphically reveals vowel formants and bandwidth. The singers formant can be observed dynamically. This real-time training feature provides a powerful tool to singer and voice teacher. With this tool, singer and voice teacher can easily assess their vocal ability from computer screen. The figure 6 shows a plot of real-time formant display. It also provides a clear display of vowel /i/ and /a/ difference (/a/ in blue color, /i/ in red color).
3.3 Real-time Spectrogram Training
Real-time spectrogram with visual feedback to the singer provides majors advantages to acoustic assessment of singers voice. The figure 7 provides a real-time narrowband spectrogram display, illustrating vibrato with regular harmonic pattern in the formant range. The real-time wideband spectrogram as in figure 8 is characterized by singers formant with highly periodic fundamental frequency.


Fig. 7. Real-time narrowband spectrogram display is useful to show vibrato. (From Dr. Speech) |
Fig. 8. Wideband spectrogram display in real-time is useful to show singers formant. (From Dr. Speech) |
Research for voice disorders has been supervised by Prof. H. Kasuya at Utsunomiya University in Japan. Research for speech acoustics has been supervised by Prof. Fred Minifie at The University of Washington in USA.
ReferencesHuang, Z., Minifie, F., Kasuya, H., & Lin, X. (1995). Measures of vocal function during change in vocal effort level, Journal of Voice, 9(4): 429-438.
Miller, J.D. (1959). Nature of the vocal cord wave. Journal of Acoustic Society of American 31: 667-677.
Minifie, F., Hixon, T. J., & Williams, F. (1973). Normal Aspects of Speech, Hearing, and Language. Prentice-Hall, Inc.
Relationship Between Acoustic Measures of Voice and Judgments of Voice Quality
Daniel Zaoming Huang, Ph.D.
The goal of the present study was to develop non-invasive techniques for the assessment of voice disorders and for monitoring the effects of voice. However, before acoustic algorithms can be used in clinical applications, it is important to understand the relationship between clinical perceptions of voice quality and the acoustic measures obtained from normal and pathological voice signals. This section focuses on how quantitative acoustic parameters predict qualitative perceptual judgments of voice quality (e.g., perceived degree of breathiness, harshness, and hoarseness).
1. Purposes
Two basic approaches exist to studying of the relationship between acoustic parameters and perceptual dimensions: Analysis-Perception approach, and Synthesis-Perception approach.
This study employed the Synthesis-Perception approach. The first step required in this approach is to develop a voice synthesizer capable of: (1) producing natural sounding vowels, (2) simulating normal voice production, and (3) simulating pathological voice production (Huang, & Hu, 1988; Huang, et al., 1992; 1994; Kasuya, 1989; 1990; Orlikoff, & Huang, 1991). Such a synthesizer has been developed as a part of this study. Included in the voice synthesis system are algorithms permitting F0 variation, period perturbation (jitter), amplitude perturbation (shimmer), simulated magnitudes of glottal noise (NNE), spectral tilt and so on. Each of these parameters can be varied relatively independently. This capability allows the investigator to study the perceptual consequences of varying only one acoustic parameter at a time.
The second step in this research was to compare the acoustic properties of synthesized vowels with judgments of voice quality. Thus, this study was designed to investigate the relationships among selected acoustic measures (F0, jitter, shimmer, glottal noise, spectral tilt, and formant flutter) and voice quality judgments (breathiness, harshness, and hoarseness).
More specially, this study systematically investigated six aspects of acoustic-perceptual relationships during vowel production.
These six acoustic measures are assumed to reflect a direct relationship to laryngeal function during voice production. The importance of these relationships has led otolaryngologists and voice clinicians to consider using these non-invasive acoustic measures to obtain diagnostically significant information and to monitor the effects of voice therapy. The potential application of these acoustic measures may be enhanced once it is understood how they relate to judgments of voice quality.
2. Development of a voice synthesizer
Since early in 1988, we have been developing a software-based voice synthesizer. The synthesizer is designed on the basis of Klatts glottal-source model. Figure 1 presents a block diagram of the voice synthesizer used in these experiments.

By controlling selected acoustic and/or physiologic parameters in this special synthesizer, it is possible to: (1) simulate normal voice production, (2) simulate some types of pathological voice production, and (3) investigate which of these parameters are appropriate for use in the analysis of the pathological voices. This last issue requires comparisons between acoustic variations in the synthesized vocalizations and voice quality judgments.
With this voice synthesizer, each vowel can be synthesized with a specified amount of each acoustic parameter (for example, F0, jitter, shimmer, glottal noise, spectral tilt, formant flutter). Therefore, it is possible to specify a desired combination of acoustic parameters and the magnitude of each parameter. The control parameters are grouped into three parts: glottal source, pitch contour control, and formant frequencies. The control parameters of the voice synthesizer are displayed in the Figure 2.

The terms identified below are defined at length in Appendix A. In the pitch contour part of the voice synthesizer, there are six ways in which to control the pitch contour: (1) linear flat (no change in F0 - just specify desired F0), (2) linear increase in F0 - specify the beginning and ending F0, (3) linear decrease in F0 - specify the beginning and ending F0, (4) linear broken F0 - allows the investigator to design various F0 contours by specifying the onset F0, the minimum F0, the maximum F0 and the rise-time (RT) in ms to go from the onset F0 to the maximum F0 and the fall-time (FT) in ms to go from the maximum F0 to the minimum F0. The difference between the RT + FT and the maximum data length (ms) will be the amount of time the voice signal will remain at maximum F0, (5) archetypal F0 will provide a typical rise-fall contour that roughly follows changes in subglottal pressure, and (6) a choice of one of the 4 tonal patterns used in Mandarin Chinese (with a built-in dynamic F0 contour): flat, rise, fall-rise, fall. In each of these tonal patterns a minimum and maximum F0 can be specified for each vowel being simulated. In the glottal source section of the voice synthesizer, the experimenter can specify the sampling frequency, maximum data length (total duration of the synthesized vowel), pitch flutter for simulating jitter, amplitude of voicing (AV), open quotient (OQ), spectral tilt (ST), amplitude flutter (FA) for simulating shimmer, voice rising time (VRT), voice falling time (VFT), maximum of voltage (MOV), low frequency gain (LFG), high frequency gain (HFG), high frequency gain begin (HFB) and high frequency gain end (HFE). In the formant frequency part of synthesizer, there are controls for: formant frequency (F1, F2, F3, F4, F5), formant bandwidth (B1, B2, B3, B4, B5), the number of formants and formant flutter. This easy-to-use voice synthesizer provides a flexible tool for generating realistic sounding vowels.
3. Synthesis-Perceptual approach: Experiment
The purpose of the Experiment was to generate a series of synthetic vowels where only one acoustic parameter at a time would be changed, and then evaluate the effects of these changes on judgments of breathiness, harshness, and hoarseness.
1) Stimuli
Thirty six tokens of sustained /ae/ as in "bat" were synthesized for this experiment. The five formants selected for /ae/ were 660, 1720, 2410, 3500, 4400 Hz (Peterson, & Barney, 1952). The five bandwidths were 75, 75, 110, 120, 120 Hz, respectively. The 500-ms duration vowel stimuli were synthesized with equal 60dB overall RMS sound pressure levels (with a 40-ms voice amplitude rising time and a 40-ms voice amplitude falling time). The stimuli were synthesized at a 44,100 Hz sampling frequency with a 16-bit resolution.
Six groups of stimuli were synthesized for use in this study (refer to Table 1 for values of F0, jitter, shimmer, NNE, spectral tilt, and formant flutter created for the Experiment). The first group has six synthesized samples of /ae/, each having a different fundamental frequency (F0). The F0 ranged from 100 Hz to 150 Hz with step sizes of 10 Hz. The second group has eight synthesized samples of /ae/, each having different magnitudes of F0 jitter levels. The third group had eight synthesized samples of /ae/ with a maximum amplitude of 10000 points. Each token was produced at a different shimmer level. The fourth group had eight synthesized samples of /ae/ with a 60 dB amplitude of voicing. Each token was produced with a different glottal noise energy level. The fifth group of stimuli included three synthesized samples of /ae/, each having a different spectral tilt ranging from 0 to 6 dB with a step size of 3 dB. The sixth group of stimuli had three synthetic vowel samples, each having different values of formant flutter (ranging from 5 to 15% with a step size of 5%). Thus, a total of 36 synthetic vowels were generated with Dr. Speech Science for Windows software (Tiger Electronics, Inc.). These stimuli were stored on the computers hard disk and on high-quality audio tape (CrO2).
Table 1 : Acceptable levels for six control parameters in voice synthesis
Level |
|
|
|
|
|
|
1 |
100 |
0.00 |
0.51 |
-23.63 |
0 |
5 |
2 |
110 |
0.35 |
0.77 |
-22.25 |
3 |
10 |
3 |
120 |
0.51 |
1.36 |
-20.25 |
6 |
15 |
4 |
130 |
0.80 |
1.92 |
-17.35 |
||
5 |
140 |
1.02 |
2.67 |
-13.96 |
||
6 |
150 |
1.35 |
3.28 |
-10.64 |
||
7 |
1.79 |
3.93 |
-7.33 |
|||
8 |
2.06 |
4.84 |
-4.82 |
2) Presentation of Stimuli
The 36 vowel tokens were randomly arranged for presentation to listeners. Each vowel token was played three times, with an interstimulus interval of 1 second. There was 3.0 seconds of silence included between each block of three repeated vowel presentations to allow listeners sufficient time to rate the voice quality of the stimulus. The order of presentation of stimuli is schematized below.
3) Listeners and perceptual ratings
Eight laryngologists served as listeners for this experiment. Three were from Shanghai University of Technology; and four were from Shanghai ENT hospital, and one from a private clinic in New York City. These eight listeners (45-55 years old) are well-trained laryngologists. Each of the listeners had extensive clinical experience with voice disorders in hospital setting. It was reasoned that these clinical laryngologists would be well qualified to judge the voice qualities of breathiness, harshness and hoarseness.
The seven laryngologists in Shanghai were required to perform the listening tasks in a quiet room. The stimuli were presented through two audio speakers at a comfortable loudness level. The seven listeners were required to judge only one voice quality at a time. First, they were required to judge breathy voice quality. Next, they were asked to judge harsh voice quality. Finally, they judged hoarse voice quality. All listeners participated in a brief training period prior to the beginning of the experiment. The same rating form was used to evaluate all three of the perceptual dimensions: hoarse, harsh, and breathy voice quality. The rating form used a four-point, equal-appearing-intervals scale to rate each voice quality (Hirano, 1981; Kasuya, 1986). The values on the scale were: "0" normal, "1" slight, "2" moderate, and "3" extreme. After the listeners heard each block of three samples directly from the computer, they were required to check an appropriate answer on the rating form (see Appendix B for a copy of the rating form). The laryngologist in New York was required to perform same procedure from the audio recording.
4) Results
The results of this experiment are shown in the following figures and tables. The values reported are the mean ratings for breathiness, harshness, and hoarseness from the eight listeners.
a. Relationship between fundamental frequency and vocal quality
This first group of vowel stimuli consisted of six vowel samples, having different fundamental frequencies ranging from 100 Hz to 150 Hz in 10-Hz step, shown on the Table 2, while other parameters were maintained constant at normal levels (jitter = 0.3 %, shimmer = 1 %, glottal noise energy = 50 dB, amplitude of voicing = 60 dB, spectral tilt = 0 dB, and formant flutter = 0 %). The mean hoarseness, harshness and breathiness ratings from eight listeners across the six different fundamental frequency (F0) levels are shown in Table 2.
Table 2. Means and Standard Deviations of voice quality ratings at different F0 levels
F0 Level |
|
|
|
1 (100 Hz) |
0.50 (0.50) |
0.88 (0.33) |
0.00 (0.00) |
2 (110 Hz) |
0.50 (0.50) |
0.63 (0.48) |
0.25 (0.43) |
3 (120 Hz) |
0.38 (0.48) |
0.50 (0.50) |
0.38 (0.48) |
4 (130 Hz) |
0.25 (0.43) |
0.50 (0.50) |
0.38 (0.48) |
5 (140 Hz) |
0.25 (0.43) |
0.38 (0.48) |
0.50 (0.50) |
6 (150 Hz) |
0.13 (0.33) |
0.13 (0.33) |
0.75 (0.43) |
A one-way analysis of variance (ANOVA) was used to determine if the voice quality ratings differed significantly across the six fundamental frequency (F0) levels. For hoarse voice quality, no significant differences existed among the six F0 levels (p<=0.05), and a Tukey post hoc test of all pairwise differences among the means indicated no significant differences. For harsh voice quality, no significant main effect existed among the six F0 levels (p<=0.05), but a Tukey post hoc test of all pairwise differences among the means indicated a significant difference between the mean ratings for level 1 (F0 = 100 Hz) and level 6 (F0 = 150 Hz). For breathy voice quality, no significant differences existed for six levels of F0 (p<=0.05), but a Tukey post hoc test of all pairwise differences in means indicated a significant difference between level 1 (F0 = 100 Hz) and level 6 (F0 = 150 Hz).
Bar display of breathy, harsh, and hoarse ratings at different fundamental frequency levels are shown in the Figure 3.

at different fundamental frequency levels for synthetic /ae/ tokens.
A series of Pearson Product Moment correlations were computed to show the strength and direction of the relationship among the manipulated F0 and the hoarseness, harshness and breathiness judgments. A correlation matrix showing the interrelationships among F0 and hoarseness, harshness and breathiness judgments is shown in Table 3.
Table 3. Pearson Product Moment correlations among voice quality judgments and F0
Hoarseness |
Harshness |
Breathiness |
|
F0 |
-0.285 |
-0.439 |
0.454 |
The trends obtained from the statistical analyses of the effects of changes in fundamental frequency on voice quality judgments of voice quality are:
b. Relationship between pitch period perturbation (jitter) and vocal quality
This group of vowel stimuli consisted of eight samples, each having different period perturbation (jitter), shown on the Table 4, while other parameters were maintained at normal levels (F0 = 125 Hz, shimmer = 1 %, glottal noise energy = 50 dB, amplitude of voicing = 60 dB, spectral tilt = 0 dB, and formant flutter = 0 %). The mean hoarseness, harshness and breathiness ratings from the eight listeners across different jitter levels are shown in Table 4.
Table 4. Means and Standard Deviations of voice quality ratings at different jitter levels
Jitter Level |
|
|
|
1 (0.00 %) |
0.25 (0.43) |
0.25 (0.43) |
0.13 (0.33) |
2 (0.35 %) |
0.50 (0.50) |
0.63 (0.48) |
0.25 (0.43) |
3 (0.51 %) |
1.13 (0.33) |
1.25 (0.43) |
0.25 (0.43) |
4 (0.80 %) |
1.38 (0.48) |
1.50 (0.50) |
0.25 (0.43) |
5 (1.02 %) |
2.00 (0.50) |
2.13 (0.60) |
0.38 (0.48) |
6 (1.35%) |
2.38 (0.48) |
2.25 (0.43) |
0.38 (0.48) |
7 (1.79 %) |
2.63 (0.48) |
2.75 (0.43) |
0.50 (0.50) |
8 (2.06 %) |
2.88 (0.33) |
3.00 (0.00) |
0.50 (0.50) |
A one-way ANOVA was used to test whether voice quality rating differed significantly across the eight synthesized jitter levels. For hoarse voice quality, significant differences were present among the eight jitter levels (p<=0.05), and a Tukey post hoc test of all pairwise differences in means indicated significant differences among most of pairs. Significant differences in harshness were present among the eight jitter levels (p<=0.05), and a Tukey post hoc test of all pairwise differences in means indicated significant difference differences among most of pairs. However, for breathy voice quality, no significant differences were present among the eight jitter levels (p<=0.05), and a Tukey post hoc test of all pairwise differences in means indicated no significant difference.
Bar display of hoarse, harsh, and breathy rating at different jitter levels are shown in Figure 4.

at different jitter levels for synthetic /ae/ tokens.
A series of Pearson Product Moment correlations were computed to show the strength and direction of the relationship among the manipulated jitter and the hoarseness, harshness and breathiness judgments. A correlation matrix showing the interrelationships among jitter and hoarseness, harshness and breathiness judgments is shown in Table 5.
Table 5. Pearson Product Moment correlations among voice quality judgments and jitter
Hoarseness |
Harshness |
Breathiness |
|
Jitter |
0.892 |
0.894 |
0.254 |
The conclusions that may be drawn from the analyses reported in this section are:
c. Relationship between amplitude perturbation (shimmer) and vocal quality
This group of /ae/ vowel stimuli consisted of eight samples, each having a different amplitude perturbation (shimmer), as shown in Table 6, while other parameters were maintained at normal levels (F0 = 125 Hz, jitter = 0.3 %, glottal noise energy = 50 dB, amplitude of voicing = 60 dB, spectral tilt = 0 dB, formant flutter = 0 %). The mean hoarseness, harshness and breathiness rating from the eight listeners across the different shimmer levels are shown in Table 6.
Table 6. Means and Standard Deviations of voice quality ratings at different shimmer levels
Shimmer Level |
|
|
|
1 (0.51 %) |
0.13 (0.33) |
0.13 (0.33) |
0.13 (0.33) |
2 (0.77 %) |
0.25 (0.43) |
0.13 (0.33) |
0.13 (0.33) |
3 (1.36 %) |
0.88 (0.33) |
0.50 (0.50) |
0.25 (0.43) |
4 (1.92 %) |
1.75 (0.66) |
0.50 (0.50) |
0.25 (0.43) |
5 (2.67 %) |
2.13 (0.60) |
0.63 (0.48) |
0.38 (0.48) |
6 (3.28 %) |
2.75 (0.43) |
0.63 (0.48) |
0.63 (0.48) |
7 (3.93 %) |
2.63 (0.48) |
0.63 (0.48) |
0.50 (0.50) |
8 (4.84 %) |
3.00 (0.00) |
1.13 (0.60) |
0.63 (0.48) |
A one-way ANOVA was used to determine if voice quality ratings differed significantly across the eight synthesized shimmer levels. For hoarse voice quality, significant perceptual differences were present among eight levels (p<=0.05), and a Tukey post hoc test of all pairwise differences in means indicated significant differences among most of pairs. Harsh voice quality was significantly affected by the eight shimmer level (p<=0.05), and a Tukey post hoc test of all pairwise differences in means indicated a significant difference between level 1 (Shimmer = 0.51 %) and level 7 (Shimmer = 3.93 %), and level 1 (Shimmer = 0.51 %) and level 8 (Shimmer = 4.84 %). For the breathy voice quality, no significant differences were apparent among eight shimmer levels (p<=0.05), and a Tukey post hoc test of all pairwise differences in means indicated no significant difference.
Bar display of hoarseness, harshness and breathiness rating at different shimmer levels are shown in Figure 5.

at different shimmer levels for synthetic /ae/ tokens.
A series of Pearson Product Moment correlations were computed to show the strength and direction of the relationship among the manipulated shimmer and the hoarseness, harshness and breathiness judgments. A correlation matrix showing the interrelationships among shimmer and hoarseness, harshness and breathiness judgments is shown in Table 7.
Table 7. Pearson Product Moment correlations among voice quality judgments and shimmer
Hoarseness |
Harshness |
Breathiness |
|
Shimmer |
0.895 |
0.489 |
0.377 |
The major findings from the analysis in this section are summarized as follows:
d. Relationship between glottal noise energy (NNE) and vocal quality
This group of synthetic /ae/ stimuli consisted of eight samples, each having a different level of glottal noise energy (NNE), shown in Table 8, while other acoustic parameters were maintained at normal levels (F0 = 125 Hz, jitter = 0.3 %, shimmer = 1 %, spectral tilt = 0 dB, and formant flutter = 0 %). The mean hoarseness, harshness and breathiness ratings of eight listeners for the different NNE levels are shown in Table 8.
Table 8. Means and Standard Deviations of voice quality ratings at different NNE levels
NNE Level |
|
|
|
1 (-23.63 dB) |
0.13 (0.33) |
0.00 (0.00) |
0.00 (0.00) |
2 (-22.25 dB) |
0.25 (0.43) |
0.25 (0.43) |
0.25 (0.43) |
3 (-20.25 dB) |
0.75 (0.43) |
0.13 (0.33) |
1.13 (0.60) |
4 (-17.35 dB) |
1.38 (0.48) |
0.38 (0.48) |
1.63 (0.48) |
5 (-13.96 dB) |
1.88 (0.33) |
0.25 (0.43) |
2.38 (0.48) |
6 (-10.64 dB) |
2.25 (0.43) |
0.50 (0.50) |
2.38 (0.48) |
7 (-7.33 dB) |
2.50 (0.50) |
0.75 (0.43) |
2.75 (0.43) |
8 (-4.82 dB) |
2.88 (0.33) |
0.75 (0.43) |
3.00 (0.00) |
A one-way ANOVA was used to determine whether hoarseness, harshness and breathiness ratings differed significantly across the eight NNE levels. Hoarseness, and breathiness ratings were significantly different among the eight NNE levels (p<=0.05), and a Tukey post hoc test of all pairwise differences in means indicated significant difference among most of pairs. Harshness ratings were significantly different among the eight NNE levels (p<=0.05), and a Tukey post hoc test of all pairwise differences in means indicated significant differences between level 1 (NNE = -23.63 dB) and level 7 (NNE =-7.33 dB), and level 1 (NNE = -23.63 dB) and level 8 (NNE = -4.82 dB).
Bar display of hoarseness, harshness and breathiness ratings at different NNE levels are shown in Figure 6.

at different NNE levels for synthetic /ae/ tokens.
A series of Pearson Product Moment correlations were computed to show the strength and direction of the relationship among the manipulated NNE and the hoarseness, harshness and breathiness judgments. A correlation matrix showing the interrelationships among NNE and hoarseness, harshness and breathiness judgments is shown in Table 9.
Table 9. Pearson Product Moment correlations among voice quality judgments and NNE
Hoarseness |
Harshness |
Breathiness |
|
NNE |
0.913 |
0.493 |
0.906 |
The findings from the analyses in this section are summarized as follows:
e. Relationship between spectral tilt (ST) and vocal quality
This group of synthetic /ae/ stimuli consisted of three samples, each having a different spectral tilt (ST) level, ranging from 0 to 6 dB with a step size of 3 dB. Other acoustic parameters were maintained at normal levels (F0 = 125 Hz, jitter = 0.3 %, shimmer = 1 %, glottal noise = 50 dB, amplitude of voicing = 60 dB, and formant flutter = 0 %). The mean hoarseness, harshness and breathiness ratings from eight listeners at three spectral tilt levels are shown in Table 10.
Table 10. Means and Standard Deviations of voice quality ratings at different ST levels
Spectral Tilt Level |
|
|
|
1 (5 %) |
0.25 (0.43) |
0.13 (0.33) |
0.38 (0.48) |
2 (10 %) |
1.00 (0.00) |
0.38 (0.48) |
1.63 (0.48) |
3 (15 %) |
2.00 (0.50) |
0.63 (0.48) |
2.38 (0.48) |
A one-way ANOVA was used to test whether voice quality ratings differed significantly across simulated spectral tilt levels. For hoarse and breathy voice quality, significant differences were apparent among three spectral tilt levels (p<=0.05), and a Tukey post hoc test of all pairwise differences in means indicated significant differences among all pairs. For harsh voice quality, no significant differences existed among three spectral tilt levels (p<=0.05), and a Tukey post hoc test of all pairwise differences in means indicated no significant difference.
Bar display of hoarseness, harshness and breathiness at different spectral tilt levels are shown in Figure 7.

A series of Pearson Product Moment correlations were computed to show the strength and direction of the relationship among the manipulated spectral tilt and the hoarseness, harshness and breathiness judgments. A correlation matrix showing the interrelationships among spectral tilt and hoarseness, harshness and breathiness judgments is shown in Table 11.
Table 11. Pearson Product Moment correlations among voice quality judgments and ST
Hoarseness |
Harshness |
Breathiness |
|
Spectral Tilt |
0.880 |
0.422 |
0.854 |
The major findings in this section are:
f. Relationship between formant flutter (FF) and vocal quality
This group of vowel stimuli consisted of three samples, each having a different formant flutter (FF) levels, ranging from 5 to 15 % with a step size of 5%, while other acoustic parameters were maintained at normal levels (F0 = 125 Hz, jitter = 0.3 %, shimmer = 1 %, glottal noise = 50 dB, amplitude of voicing = 60 dB, and spectral tilt = 0 dB). The mean hoarseness, harshness and breathiness ratings from eight listeners at the different formant flutter levels are shown in Table 12.
Table 12. Means and Standard Deviations of voice quality ratings at different FF levels
|
|
|
|
1 (5 %) |
0.38 (0.48) |
0.50 (0.50) |
0.25 (0.43) |
2 (10 %) |
0.50 (0.50) |
0.75 (0.66) |
0.50 (0.50) |
3 (15 %) |
0.88 (0.78) |
1.13 (0.60) |
0.38 (0.48) |
A one-way ANOVA was used to test if the perceptual ratings of the voice qualities differed significantly across the three formant flutter levels. For hoarse, harsh and breathy voice quality ratings, no significant differences existed for the three formant flutter levels (p<=0.05), and a Tukey post hoc test of all pairwise differences in means indicated no significant differences. Scatter plots of hoarseness, harshness and breathiness at different spectral tilt levels are shown in Figure 8.

A series of Pearson Product Moment correlations were computed to show the strength and direction of the relationship among the manipulated formant flutter and the hoarseness, harshness and breathiness judgments. A correlation matrix showing the interrelationships among formant flutter and hoarseness, harshness and breathiness judgments is shown in 13.
Table 13. Pearson Product Moment correlations among voice quality judgments and FF
Hoarseness |
Harshness |
Breathiness |
|
Formant Flutter |
0.319 |
0.396 |
0.105 |
The major findings in this section may be summarized as follows:
Formant flutter does not appear to significantly influence the perceptions of hoarse, harsh, and breathy voice quality during production of these simulated /ae/ vowel.
g. Correlation between acoustic parameters and perceptual judgments
A series of Pearson Product Moment correlations were computed to show the strength and direction of the relationship between each of the acoustical parameters manipulated in this experiment and perceptual judgments of hoarseness, harshness and breathiness. A correlation matrix showing the interrelationships among all of the acoustical parameters (F0, jitter, shimmer, NNE, spectral tilt, and formant flutter) and perceptual judgments (hoarseness, harshness and breathiness) is shown in Table 14.
Table 14. Pearson Product Moment correlations among all of the acoustic parameters measured and perceptual judgments
Hoarseness |
Harshness |
Breathiness |
|
F0 |
-0.285 |
-0.439 |
0.454 |
Jitter |
0.892 |
0.894 |
0.254 |
Shimmer |
0.895 |
0.489 |
0.377 |
NNE |
0.913 |
0.493 |
0.906 |
Spectral Tilt |
0.880 |
0.422 |
0.854 |
Formant Flutter |
0.319 |
0.396 |
0.105 |
4. Conclusion
Regarding the interpretation of acoustic parameters, our tentative conclusions from this study are:
Home | Dr. Speech 3 |Dr. Speech 4 | Distributors | Information | Contact us
Tiger DRS Inc. 1998