Speech recognition capabilities on personal computers may soon become common applications for those who wish to have relief from inordinate amounts of typing or data entry. Perfect speech recognition is difficult to achieve, however, thanks to variations in speech from person to person. The development of these capabilities requires understanding of human speech variance and the methods used by computers to recognize speech.
Introduction
Imagine that you have been typing all day long and your hands are aching. You still have more work to do, but you feel like you can’t type another word. Wouldn’t it be great if you could simply talk into a microphone and the computer would do all of the typing for you? With the help of speech recognition software, it is possible for your very own desktop computer to type exactly what you tell it to type. Engineers are currently working on making speech recognition available for widespread, everyday use (Fig. 1). To understand the engineering behind speech recognition technology we need to look at the background of human speech variance, why speech recognition is difficult for a computer, and how a computer actually recognizes speech. Finally, we will see how speech recognition can be used in our everyday lives.
Teber/SXC
Background on Human Speech Variance
Human speech varies quite a bit. Speech differs from location to location. Even the speech of a single person can vary time from time. Phil Woodland, a member of the Cambridge University Engineering Department, divides the variability of human speech into these four categories.
- Intra-speaker
- Inter-speaker
- Speaking style
- Acoustic channel
Intra-speaker characteristics are what, internally, make a person speak differently than their normal speech. Emotions can play a large role in varying speech from normal tones. A happy person’s voice sounds very different from a sad person. Someone who is angry will tend to speak louder and will yell more than a person who is calm. Another example is a consideration of the environment a person is in, such as a business meeting. A more articulated form of speech is used in a meeting than that used in casual conversation [1].
Inter-speaker characteristics are external factors between speakers that affect speech. For example, an accent is an inter-speaker characteristic. A word can sound very different when subjected to different accents. A person from England would pronounce “car” very differently than someone from Boston. The person from England may say “kar” with a slight accent while the person from Boston would say “kaa” without the “r” sound. A second example is a local dialect that people from a certain area use; consider how many people from Texas say “y’all” [1].
Speaking style is defined as the way a person speaks. Is the person reading from a book or speaking spontaneously? This factor can greatly change the way a voice sounds. A person may sound more monotonous when reading from a book than if he or she was speaking conversationally [1].
An acoustic channel is a method by which the speech is carried. For example, when talking on a phone, the person’s voice on the other end of the line may sound significantly different than his or her real voice. The same thing happens if there is background noise while a person is talking [1].
Why Speech Recognition is Hard for Computers
The main reason that speech recognition is difficult for computers is because of how much our speech varies. Some people speak rapidly, some have low voices, and others may have an accent. It is a routine and rather easy task for people to decipher one another’s speech, though for computers this is a much more difficult task because they cannot detect the variants in a person’s speech.
It is not possible for a computer to deal with all of these variations, so most speech recognition systems limit the way a person can speak. One limit may be to have the person pause between each word so two words are not get run together. Another limit may be not allowing the person to spontaneously speak because can be too many variations to deal with. An illustration could be drawing out the length of a word, as in saying “I waaaaaaaaant” instead of “I want” while deciding on something [1].
Another problem that a computer has to deal with is the fact that many words or phrases sound similar. For example, Professor Gerald Gazdar of the University of Sussex at Brighton points out that if a person says “an ice cream,” it could also be interpreted as “and nice cream” or “and nice scream” by a computer because all three phrases sound very similar [2]. A computer has a hard time telling what it is the person actually trying to say because it cannot infer context as can the human ear. We can perceive the setting we are in or what the conversation we are having is about to help us decipher what someone is saying. A computer cannot do this because it is not able to think independently.
How Speech Recognition Works
To recognize speech, a computer needs information about our language. It breaks up this information into three different parts: [1]
- Hidden Markov Models (HMM)
- A pronunciation dictionary
- A language model
HMM are acoustic models of individual speech sounds [1]. Every sound that we make, for example “th,” is made from even smaller sounds which we will call states. The states can be put together to form different sounds. An HMM decides if the uttered sound was a “th” by going through a series of states and then checking if the states from beginning to end make up the sound “th” [1]. There is an exact set of states in an exact order that make up the sound “th.”
An HMM will not just go from one state to any other random state because there are only a certain number of sounds in our language. Each state only branches off to other states that match up to other possible sounds. Letting states branch off to any other state would create sounds we wouldn’t understand such as “xq.” It is similar to baking a cake. There are different states in the baking process, such as mixing or baking the cake in the oven. There is a logical progression to each state, and you will only have a cake if you progress through each state in order.
The next piece of information the computer needs is a pronunciation dictionary that has all of the words the computer can recognize represented by the HMMs that make up those words [1]. For example, the word “the” would be made up of the HMMs “th” and “ax.” When the computer detects the “th” sound, it looks in the pronunciation dictionary to see if there is a single word made up by the “th” sound.
The computer finds that there are no words made up by the single “th” sound, so it waits for another sound to complete the word. The next sound that comes in is “ax,” so the computer goes back into the pronunciation dictionary and sees that “th” and “ax” make the word “the.” Think of this as a person looking up a word in a normal dictionary by sound instead of by alphabetical order. The final thing the computer needs is a language model, which gives the computer the probability of different word sequences [1].
The language model most frequently used is called an N-gram model. The N-gram model calculates the probability of the next word by using the knowledge of the previous N-1 words [1]. An example that shows how the same sounds can make different sequences can be seen in the earlier example from Professor Gazdar. The phrases “an ice cream,” “and nice cream,” and “and nice scream” will each have a different probability of occurring. Then, if the next word detected is “cone,” the computer will try “cone” with the three previous phrases and recognize that “an ice cream cone” has the highest probability of occurring. Once the computer has all of this data, it is ready to start decoding speech. The first thing the computer does is match up all of the sounds with the acoustic models already stored. If someone said, “this is an ice cream cone,” all of the different sounds, “th,” “ih,” “s,” and so on would be detected by the computer. It then goes into the pronunciation dictionary and finds which of the assembled sounds form words that are in the dictionary. Three phrases would be constructed after checking the pronunciation dictionary, “this is an ice cream cone,” “this is and nice cream cone,” and “this is and nice scream cone.” Finally, the computer takes all of the words that were found in the dictionary and uses the language model to see which words actually form a sequence and which sequence has the highest probability of occurring [1]. The sequence that would have the highest probability of occurring is “this is an ice cream cone.”
Everyday Applications
One application of speech recognition in wide use right now is dictation. Dictation allows users to enter data without having to type. All the user does is speak into a microphone and the words appear on the screen as if they had been typed. This is useful for individuals with carpal tunnel syndrome. These individuals can still be productive with a computer without having to endure the pain of typing. Dictation is also very useful for the visually impaired, because speaking into a microphone may be much easier than typing on a keyboard.
An interesting application for speech recognition is using it over a telephone system. This is where the acoustic channel discussed earlier can come into play. Nabuo Hataoka, Toshiyuki Odaka, and Akio Amano, all part of the Central Research Lab of Hitachi Ltd., created an automated telephone operator system. The system they implemented for testing could recognize about 200 words. The system could recognize a phrase such as, “Please call Mr. Sato of Intelligent Systems Department” [3].
This is much more convenient than the normal automated operator systems that makes a person push a button at every different option menu. If this system were extended to recognize a larger number of words it could be very useful. There would be 24-hour telephone operator service without the use of any human operators.
Speech recognition capabilities might also be built into devices you use daily so you could actually tell them what you want them to do. You could have a VCR that would be voice-programmable [4]. There would be no more need to fiddle with all the buttons on the remote control. The same thing could be done for a television set. You would never again have to search for the remote control.
References
-
- [1] P. Woodland. “Speech recognition”. Speech and Language Engineering – State of the Art 499, IEEE Colloquium on (1998): 2/1-2/5. On-line. 19 Nov 1998.
- [2] Gerald Gadzar. “Speech recognition. Lecture 7: Hidden Markov models.” University of Sussex. 25 March 1999. <http://www.informatics.susx.ac.uk/research/groups/nlp/gazdar/teach/nlp/nlpnode47.html#SECTION00027300000000000000>.
- [3] N. Hataoka, Odaka, T., and A. Amano. “Speech recognition system for automatic telephone operator based on CSS architecture” in Interactive Voice Technology for Telecommunications Applications (September 1994), pp. 77-80.
- [4] Muhammad Nawaz. “Real-World Speech Recognition Applications.” Suite101. Interenet: http://www.suite101.com/article.cfm/artificial_intelligence/8613 [7 July 1998].