The Inner Workings of Speech Recognition

Seth Capistron was a junior Computer Science major at the time of publication. He enjoys golf, going to concerts, and spending time with friends.

With research focusing on different ways for people to interact with computers, speech recognition is emerging as a very important technology. Whether it is using a voice controlled navigation system in a car, or a voice controlled system over the phone, speech recognition is bound to play a larger role in society. Ultimately, various theories on human speech and complementary processes to computational processing of language are currently being applied to this emerging technology.

Introduction

When most people think of interacting with a computer, they think of a mouse and a keyboard. However, researchers are constantly thinking up new ways for users to interact with computers. For example, imagine being able to think about what you want your computer to do and then having your computer actually do it. Some of these incredible user interfaces are already in use today. Fifteen years ago, if you told someone that you could be able to get in your car, talk to a computer, and have it give you detailed directions, most people would not have believed you. This type of interaction with computers was made possible by the engineering behind speech recognition.

Speech recognition allows users to speak to a computer (see Fig. 1) and have their words interpreted as instructions or text. Although speech recognition has been around since virtually the beginning of the computing era, it has not been until recently that the quality has been reliable enough to justify wide-scale use. It has only been with the developments made by engineers that this technology has gone from science fiction to a tool of the present.

Alpha/Moto Q9h

Figure 1: Speech recognition software permits users to interact with technology, including computers.

Inherent Difficulties

Since speech is commonplace to human interaction, it may at first be hard to see why a computer would have difficulty interpreting it. The skill of being able to interpret what someone says with amazing accuracy is often taken for granted. Another skill that humans undervalue is the ability to understand sloppy speech.

Often times when we casually talk with our friends or coworkers, we lose some of the precision with regards to our pronunciation. Many times, words or phrases get strung together, or syllables get dropped [1]. This does not present a problem for humans because we are used to hearing it on a daily basis. If someone runs the two words ”did you” together, we don’t think twice about what that person was trying to say. The same, however, cannot be said for a computer trying to interpret someone’s speech. When a phrase such as ”did you” gets strung together, it is practical to assume the computer would see it as one word. Yet this problem can be corrected rather easily by requiring the user to better enunciate their words.

Another issue in speech recognition is the existence of multiple users. Every day humans interact with dozens of different speakers. Whether it is the anchorperson on the morning news, the DJ on the radio, or the boss at the office, people find it relatively easy to understand speech from multiple sources. Yet sometimes it is not as easy. Sometimes people have trouble understanding a foreign speaker. Even though the person may be speaking English, it is still difficult to understand what they are saying because they pronounce words much differently than we are used to hearing. Computers also experience similar problems, but it has been found that they are much more sensitive than humans. Researchers have shown that speaker-dependent speech recognition systems will have three to five times fewer errors than speaker independent systems [2]. This shows just how sensitive most speech recognition systems are to who is speaking (see Fig. 2).

Alpha/Moto Q9h

Figure 2: Some phones have speech recognition software built-in.

Another aspect speech recognition systems are very sensitive to is background noise. As an example, imagine you are in your car talking to your voice-controlled navigation system. A passenger sitting next to you can easily determine what you are saying from the many noises that they actually hear. They are able to filter out the kids in the back seat, the honking of horns, and the various noises of the road. This, however, isn’t a given for the speech recognition feature in your car’s navigation system. Early on in the development of speech recognition systems, environmental noise had to be kept to a bare minimum to avoid confusion. While engineering has vastly reduced the sensitivity of these systems, background noise is still an issue that prevents speech recognition systems from being deployed in certain situations. For example, it is hard to imagine every single person in a crowded office space using speech recognition to interact with his or her computer.

The final set of difficulties with speech recognition systems is vocabulary and grammar. The vocabulary of a speech recognition system is the range of words the system recognizes, while the grammar is the order of the words that it recognizes [2]. While many people say they do not have very large vocabularies, they are usually underestimating themselves. The sheer number of words that humans can hear and instantly recognize is amazing. It is also amazing that we can understand everything someone says, even though they may be speaking quickly or using complex sentence structures.

This is the same thing that engineers have been working very hard to replicate. Even the most advanced speech recognition systems in the world have limited vocabulary and grammar. However, the trick to creating a speech recognition system that can both recognize a user’s input and be fast enough to keep up with the user’s speech is to limit the range of possible inputs.

Limiting Methods

One tactic to improve the accuracy of speech recognition systems is to limit the possible input from the user. Building a speech recognition system that can accept any spoken input at any time is much more difficult than a system that uses intelligence to limit the range of possible inputs. For many years engineers have worked to increase the intelligence of speech recognition systems by including the same type of logic humans fail to appreciate.

Word Transition Probabilities involve figuring out what word is most likely to come after a given word [1]. One method is to derive a matrix that contains words that are likely to occur after a given word as well as the probability that each word will occur. The speech recognition system can use these matrices to limit the search space of possible inputs. Rather than being forced to search the entire vocabulary to decide what the user has said, it can just search the matrix of words that will come after the word before it. This idea can also be extended out from just words to categories. Words can be categorized and the probabilities that a category will appear after another category are stored in a matrix. This also serves to limit the search space of the vocabulary.

Using the example of a voice controlled navigation system we can see how word transition probabilities would be implemented in the real world. For example, if a user wanted to change their destination they may say, ”Change destination.” Once the computer has recognized the word ”change”, they can limit the search space for the next word. The computer knows that the next word it hears will not be a street or city name. Nor will the next word be the name of a restaurant. The computer knows that after someone says ”change” they will probably say something like ”display”, ”route”, or ”destination”. By limiting the number of probable inputs, the system is able to process words fast enough to keep up with the input from the driver. The next, more popular method is to limit the grammar allowed.

A speech recognition program with a strict grammar will only recognize well-formed sentences, while a program with loose grammar will recognize a wider variety. The problem with a wider-variety grammar is that the accuracy of the system tends to suffer. The principle behind grammar limiting is basically the same as word transition probabilities, except the rules for narrowing the search space are different. Grammar limiting uses a complex set of rules rooted in the English language to decide which words are possible, which are not possible, and which words are most likely to appear. This method is more often used in systems that are designed to recognize full spoken language, as opposed to specific commands from a user. Regardless of the system, however, there are a few basic components to the speech recognition system.

Components

While the implementation of speech recognition systems can be dramatically different, the components needed are essentially the same. The first item, the speech capture device, is simply whatever is used to transform spoken word into a digital signal. This usually includes a microphone and an analog-to-digital converter, which is used to transform the sound waves into the raw 0’s and 1’s that will be processed by the computer. The next component needed is a digital signal processing module, or DSP. A DSP performs endpoint detection in order to find where words or utterances begin and end [2]. It is also used to separate speech from non-speech.

The third component, the preprocessed signal storage unit, simply holds the already processed sound in a buffer until the pattern-matching algorithm is ready to look at it [2]. The other component that connects to the pattern-matching algorithm is the bank of reference speech patterns. This is what the pattern-matching algorithm will compare the received speech to in order to find a match.

The fourth and last component is the pattern-matching algorithm itself. This is the actual logic that takes the processed sound and decides which word was spoken. While many methods can be used for the pattern-matching algorithm, two are most popular. The first of the two is template matching [2].

Template Matching

Template matching is a relatively simple process that can be understood as follows. Imagine you have two line graphs: one is the input from a user and the second is the graph of a known word. In an ideal world, you would simply compare the graph from the user input to the graph of the known word. If the two graphs are the same, you have found the word the user submitted; if they are different then keep looking. We, however, do not live in a perfect world. If the speaker says the word faster or slower than our model, we will be unable to detect the match. Furthermore, if the speaker places an emphasis on a different part of the word, we will be unable to detect the match. The solution is to use a method called dynamic time warping.

If you imagine our inputted line graph as a series of data points, then you can see how it would be possible to shift these data points to the left and right. By shifting the points as needed, we can adjust for any timing difference between our input and our model. Shifting the data points will also help us to make our inputted graph more closely resemble the model graph. Then, we will take the difference of the inputted data points and the model data points. Averaging these differences will create a rank for the model word. This rank essentially tells us how close our inputted word is to our model word. By repeating this process on other model words and comparing their rank, we can see which word the input is most likely to be. While this method is simple and works well with small vocabularies, it doesn’t always prove to be accurate. For most modern speech recognition programs, hidden Markov models are used.

Hidden Markov Models

Hidden Markov models are state machines that compute the probabilities of producing sounds from state to state [2]. For an example, let’s say that we have a Markov model with three states. Each state has transitions to the other states and also is able to output a value. It is possible for a state to output multiple values and there are also probabilities associated with how often each state will output a given value. Having said this, it is possible for the same output to be created by completely different paths through the state machine. This is similar to how there are many slightly different ways to say the same word, but in the end the output is still the same. It is this parallel that engineers have taken advantage of to make hidden Markov models work well for speech recognition.

The inputted speech will be matched up with HMMs and each HMM will be assigned a probability that it could have output the given speech. The HMM with the highest probability will be the HMM that represents the word that was given as input. A slightly different solution using HMMs is this: rather than computing the probability a given HMM would output the speech, compute the probability that a given series of state transitions would have produced the speech. Whichever method is used, the idea is still the same. The main advantage of HMMs is that even though there can be slight variations in the way a user says a word, the algorithm can still find the right word. Despite the use of highly advanced algorithms and limiting techniques, speech recognition is still far from being perfect [3, 4].

Current Applications

The dream of being able to talk to our computers and tell them exactly what to do is far from being a reality. There are many problems that result from the fact that the technology we have today is simply not advanced enough. There are other problems, however, that seem to be unavoidable. Sometimes it is impossible to limit background noise, such as in an office where many other people may be talking to their computers. Other situations include noisy, outdoor work environments where it may be incredibly hard for even another person to understand what someone is saying. Finally, people find it more tiring and cumbersome to dictate a long essay than to simply type it. Yet, despite all these problems, there are many cases where speech recognition is very advantageous.

We have all, at one time or another, called a large company and been greeted by an automated system, to which we responded. These systems are able to understand so many different users because they have a very limited vocabulary. There is also a large amount of research that is going into creating voice-controlled computers and appliances for the handicapped. A mobile application is also one area where speech is sometimes easier for users than a typed response. Speech recognition is just one way engineers are working to change the way we interact with technology on a daily basis [3, 4].

References

- [1] S.R. Young, A.G. Hauptmann, et al. ”High Level Knowledge Sources in Usable Speech Recognition Systems.” Communications of the ACM, vol. 32.2, pp. 183-194, 1989.
- [2] R.D. Peacocke and D.H. Graf. ”An Introduction to Speech and Speaker Recognition.” Computer, vol. 23.8, pp. 26-33, 1990.
- [3] B.H. Juang, and L.R. Rabiner. ”Hidden Markov Models for Speech Recognition.” Technometrics, vol. 33.3, pp. 251-271, 1991.
- [4] D. O’Shaughnessy. ”Interacting With Computers by Voice: Automatic Speech Recognition and Synthesis.” Proceedings of the IEEE, vol. 91.9, pp. 1272-1300, 2003.