What’s wrong with speech recognition?

Although speech recognition is getting better and better, it keeps making mistakes that often annoy us. Much more than humans would do in similar situations. And we have been trying to make it better for decades. What’s wrong with it?

Scientists are constantly testing and trying to improve speech recognition in adverse conditions, such as when the ambient noise is quite high (like in a busy restaurant or at a cocktail party), or with highly accented speech (for instance when a foreign is talking), or when the speaker is in a situation of stress (like a fighter pilot during flight), or in presence of sever distortions (such as with a bad telephone connection or a highly reverberating room. While in all these situations human listeners can do pretty well, even if not perfectly; speech recognition systems instead degrade pretty badly. And sometimes, even the best recognizer in the best possible conditions, takes one word for another in a quite surprising manner. No matter what we do, computer speech recognition is still far from human performance.

One of the common notions, especially in the commercial and practitioner’s world, is that if we had more speech data to learn from, and faster computers, the difference between machine and human performance will go away. But that seems not to be the case. Using substantial amounts of data (thousands of hours of speech) in conjunction with faster and faster computers produces only modest improvements that seem to get closer and closer to the flat zone of diminishing returns. So, what’s wrong with speech recognition?

Recently  at the International Computer Science Institute in Berkeley,  under the auspices of IARPA and AFRL, we started a project called OUCH (as in Outing Unfortunate Characteristics of Hidden Markov Models) with the goal of giving a stab at what’s wrong with the current speech recognition and why, no matter what we do, we can’t get as close to human performance as we would like. Nelson Morgan is leading the project and working with Steven Wegmann and Jordan Cohen.

One of the issues has to do with the underlying model. All current speech recognizers are based on a statistical/mathematical model (known as Hidden Markov Model, or HMM) that was introduced in the mid 1970s. In order to be able to work with it and avoid unbearable complexity, the model makes strong assumptions, which are not verified in practice. For instance, one of the strongest assumptions is that consecutive spectral snapshots of speech (we call them “frames”) are statistically independent from each other. We know very well that this is not true, but without making believe it were, the math of HMMs would become too complex to handle, and we would never have enough data to be able to estimate their parameters effectively. In science we often make simplifying assumptions just for the purpose to be able to handle things, even if we know that the assumptions are wrong, hoping that inaccuracies caused by the assumptions are small enough to be neglected. This is the case with Newtonian physics of masses and gravitation, which works for small-scale objects, but turns to be wrong when we talk about masses of the size of planets, stars, and galaxies. While we can predict with a fairly high accuracy the motion of a tennis ball, we will never be able to predict and explain events at the galactic level using Newton’s equations. No matter what, we would need a different model to account for that scale.  Similarly, while we are able to recognize speech with decent accuracies using HMMs in good acoustic conditions, if we want to reach human comparable performance in adverse situations, we would probably need to use a different model. Which model we don’t know yet, but definitely we know that HMMs as we know them may be inadequate, especially for human-like accuracies. Indeed HMM have done have served us well until now, but we need something different to be able to move ahead.  We hope that this research will shed some light on the next gerneration of computers that understand speech.


~ by Roberto Pieraccini on July 3, 2012.

3 Responses to “What’s wrong with speech recognition?”

  1. […] end-of-life of these models and whether we need to look for something better, as I discussed in a previous post. Now, can we explain Hidden Markov Models in a simple way to make laypeople understand what they […]

  2. Fantastic goods from you, man. I have understand your stuff previous to and you’re just too wonderful. I really like what you’ve acquired here,
    really like what you’re saying and the way in which you say it. You make it entertaining and you still take care of to keep it sensible. I cant wait to read much more from you. This is really a terrific website.

  3. Wow that was strange. I just wrote an extremely long comment but after I clicked submit my comment didn’t appear.
    Grrrr… well I’m not writing all that over again.
    Anyhow, just wanted to say fantastic blog!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: