USC
About this Article
Written by: KeAloha Over
Written on: September 1st, 2000
Tags: communication, computer science
Thumbnail by: Teber/SXC
About the Author
KeAloha Over was a USC engineering student in 2000.
Also in this Issue
Pyschoacoustics and Surround Sound SystemsWritten by: Andrew Turner
Roller CoastersWritten by: Jeff Wurfel, Mark Garciano
The Engineering Behind Automotive Airbags Written by: Jesse Patterson
Stay Connected

Volume I Issue I > Talking to Your Computer
Speech recognition capabilities on personal computers may soon become common applications for those who wish to have relief from inordinate amounts of typing or data entry. Perfect speech recognition is difficult to achieve, however, thanks to variations in speech from person to person. The development of these capabilities requires understanding of human speech variance and the methods used by computers to recognize speech.

Introduction

Imagine that you have been typing all day long and your hands are aching. You still have more work to do, but you feel like you can't type another word. Wouldn't it be great if you could simply talk into a microphone and the computer would do all of the typing for you? With the help of speech recognition software, it is possible for your very own desktop computer to type exactly what you tell it to type. Engineers are currently working on making speech recognition available for widespread, everyday use (Fig. 1). To understand the engineering behind speech recognition technology we need to look at the background of human speech variance, why speech recognition is difficult for a computer, and how a computer actually recognizes speech. Finally, we will see how speech recognition can be used in our everyday lives.
Teber/SXC
Figure 1: Speech recognition may soon become a common application for household computers.

Background​ on Human Speech Variance

Human speech varies quite a bit. Speech differs from location to location. Even the speech of a single person can vary time from time. Phil Woodland, a member of the Cambridge University Engineering Department, divides the variability of human speech into these four categories.
  • Intra-speaker
  • Inter-speaker
  • Speaking style
  • Acoustic channel
Intra-speaker characteristics are what, internally, make a person speak differently than their normal speech. Emotions can play a large role in varying speech from normal tones. A happy person's voice sounds very different from a sad person. Someone who is angry will tend to speak louder and will yell more than a person who is calm. Another example is a consideration of the environment a person is in, such as a business meeting. A more articulated form of speech is used in a meeting than that used in casual conversation [1].
Inter-speaker characteristics are external factors between speakers that affect speech. For example, an accent is an inter-speaker characteristic. A word can sound very different when subjected to different accents. A person from England would pronounce "car" very differently than someone from Boston. The person from England may say "kar" with a slight accent while the person from Boston would say "kaa" without the "r" sound. A second example is a local dialect that people from a certain area use; consider how many people from Texas say "y'all" [1].
Speaking style is defined as the way a person speaks. Is the person reading from a book or speaking spontaneously? This factor can greatly change the way a voice sounds. A person may sound more monotonous when reading from a book than if he or she was speaking conversationally [1].
An acoustic channel is a method by which the speech is carried. For example, when talking on a phone, the person's voice on the other end of the line may sound significantly different than his or her real voice. The same thing happens if there is background noise while a person is talking [1].

Why Speech Recognition is Hard for Computers

The main reason that speech recognition is difficult for computers is because of how much our speech varies. Some people speak rapidly, some have low voices, and others may have an accent. It is a routine and rather easy task for people to decipher one another's speech, though for computers this is a much more difficult task because they cannot detect the variants in a person's speech.
It is not possible for a computer to deal with all of these variations, so most speech recognition systems limit the way a person can speak. One limit may be to have the person pause between each word so two words are not get run together. Another limit may be not allowing the person to spontaneously speak because can be too many variations to deal with. An illustration could be drawing out the length of a word, as in saying "I waaaaaaaaant" instead of "I want" while deciding on something [1].
Another problem that a computer has to deal with is the fact that many words or phrases sound similar. For example, Professor Gerald Gazdar of the University of Sussex at Brighton points out that if a person says "an ice cream," it could also be interpreted as "and nice cream" or "and nice scream" by a computer because all three phrases sound very similar [2]. A computer has a hard time telling what it is the person actually trying to say because it cannot infer context as can the human ear. We can perceive the setting we are in or what the conversation we are having is about to help us decipher what someone is saying. A computer cannot do this because it is not able to think independently.