Demystifying Speech Recognition … but not too much

•August 8, 2012 • 1 Comment

I really appreciate when people try to give a simplified view of technology with the goal to let the general public understand what’s behind the hood, and how complex is, oftentimes, to make things works properly.  That is the goal I had in mind when I embarked on the project of writing  The Voice in the Machine. However, I believe, we should not simplify too much, to the point of creating the perception that, after all, the problem is really simple, and everyone can do that … That oversimplification, making believe that “… after all it’s easy to do it, …and why companies and researchers spend so many cycles in trying to solve problems that anyone with decent programming skills can approach…” is deceiving the general public, and can produce false expectations.   A couple of days ago I stumbled into a white-paper  entitled “Demystifying speech recognition” portraying  speech recognition as a straightforward process based on a transformation from audio to phonemes (the basic speech sounds), phonemes to words, and words to phrases, with the “audio-to-phoneme” step described as a simple table lookup. Unfortunately speech recognition does not work like that. Or at least, let me say, high-performance, state-of-the-art speech recognition, unfortunately, does not work that way. Not that it cannot be explained in a simple way, but there are a few important differences from what was described in that white paper.

First, the idea to start with a transformation of audio into phonemes, very attractive for different reason, is quite old and does not work. Many people tried that from the early experiments in the 1960s without much success for reasons which I will explain later. Even recently there are some commercial recognizers which, for good reason, use a phonetic recognizer as a front end. Without going into details, those recognizer are mostly used offline for extracting analytics from large amounts of human speech, and not intended for human-machine interaction, and there are some reasons why a phonetic recognizer would be preferable for that. However any serious experimenter would tell you that using a phonetic recognizer as a front end of an interactive system, in other words a system where the ultimate goal is to get the words, or the concepts behind the words with high accuracy, would show degraded performance when compared to a “traditional” modern speech recognizer that goes directly from audio to words without using phonemes as an intermediate step.

The point, which I call “the illusion of phonetic segmentation” in my book, is that in a pattern recognition problem with a highly variable and noisy input (like speech), making decisions  on intermediate elements (like the phonemes) as a step towards higher level targets (like words), introduces errors that will greatly affect the overall accuracy (which is measured not on the phonemes but on the words). And even if we had perfect phonetic recognition (which we don’t…and by the way, a phoneme is an abstract linguistic invention, while words are more concrete phenomena … see my previous post), the “phonetic” variation of word pronunciation (as in poteito vs. potato, or the word “you” in “you all” pronounced as “y’all”) would introduce further errors. So, a winning strategy in patter recognition in general, and in speech recognition in particular, is that of not taking any decision until you are forced to, in other words until you get to your target output (words, concepts, or actions).

A friend of mine used to say “A good general is a lazy general”, meaning that when you have to take an important decision, the more you delay it to gather more data, the better the chance is to take a good decision, eventually. The same concept applies to speech recognition. In fact modern state-of-the art speech recognizer (yet based on ideas developed in the 1970s) do not take any decision until they have gained enough evidence that the decision–typically on which words where spoken–is, so to speak,  the best possible decision given the input data and the knowledge the system has. And this is not done using simple frame-to-phoneme mapping tables, but using sophisticated statistical models (called Hidden Markov Models, or HMMs) of phonetic units that are estimated over thousands of hours of training speech. Yet, one of the open questions today is whether we have reached the end-of-life of these models and whether we need to look for something better, as I discussed in a previous post. Now, can we explain Hidden Markov Models in a simple way to make laypeople understand what they are, and help demystifying speech recognition without describing it in the wrong way? Yes, we can, and I’ll try to do that in one of my future posts. But the point here is that, as I have said before, speech recognition reached a stage where it can be successfully deployed in many applications, also thanks to the work of thousands of people who developed and improved the sophisticated algorithms and math behind it. However, to move to the next stage, we need to continue to work on it. The problem is neither simple, nor solved.

The hard job of getting meanings

•July 29, 2012 • 3 Comments

If I had to chose one of the areas of human-machine natural communication where we haven’t ben able to make any significant stride during the past decades, I would choose “general” language understanding. Don’t get me wrong. Language understanding per se has made huge steps ahead. IBM Watson‘s victory over Jeopardy! human champions is a testimony of that. However Watson required the work of an army of the brightest minds in natural language processing, semantics, and computer science for  several years. How replicable is that for any other domain without the need of hiring 40 world-class scientists is questionable. And that’s the issue I talking about. We can say that speech recognition (that is going from “speech” to “words”) is a relatively simple technology.  If you have the right tools, all you need is a lot of transcribed speech and you get something that allows you to put together an application that works in most situations. Of course, as I said in a previous post, machines are still far from human performance, especially in presence of noise, reverberation, and other amenities where our million-year old carbon-based technology still excels. But, at least in principle, commercial recognizer do not need 40- PhDs to be put into operations. And you do not even need to understand the intricacies of spoken language. The technology evolved towards machines that can learn from data. And that’s what we have today. It is not perfect, and many scientists are working at making it better. But I am sure the next generation of speech recognition systems will also be usable in any domain without having to understand the intricacies of speech production and perception. Unfortunately that is not true for language understanding (that is going from “words” to “meanings”). If you want to put something together, something like Siri, for instance, you have to understand grammars, parsing, semantic attachments, ontologies, part-of-speech tagging, and many other things that I am not going to mention here. Let’s see why.

The ultimate product of speech recognition are the words contained in utterances. You don’t have to have a PhD in linguistics or to be a computer scientist to understand what words are. Anyone with a good mastering of a language–and almost everyone master their own mother tongue–can listen to an utterance and more or less identify the words in it. There is also a general agreement about the notion of words. Words are words are words. And we all learn words since our childhood. So the output of a speech recognizer is quite well defined. As a consequence, it is relatively easy to come up with a lot (today in the hundreds of millions) of examples of utterances and their transcription into words. And if you master the art of machine learning, you can build, once and for all, a machine that learns from all of those transcribed utterances, and the same machine will also be able to guess words it has never seen before, since the machine will learn also the word constituents, the phonemes, and it will be able to put models of phonemes together to guess any possible existing or non existing word. And once that machine is built you have a general speech recognizer.

Instead, the ultimate product of language understanding is meaning. What is meaning? Linguists and semanticists have bee arguing for decades on how to represent meaning, because a representation of meaning is not evident to us, or at least it is not as evident as words are to every speaker of a language (as long as the language “has” words). So, a representation of meaning, one of the many available in literature, has to be defined and imposed, in order to start building a machine that would be able to extract meaning from spoken utterances or text. While we can easily transcribe large numbers of utterances into words and use them to create machine-learned speech recognizers, associating meanings to a large number of utterances, or large amounts of text, it is way more laborious, and way more error prone than transcriptions. Moreover, associating meanings to utterances or text can be done only if you have a rather good understanding of linguistics and the chosen meaning representation. And, besides that, meaning representations depend on domains. Think for instance the meaning of the word “bank” in a financial domain, as opposed to the meaning of the same word in aeronautics or geography.

This is why, in my opinion, we do not have yet general language that can learn from data in any domain and in any situation. The point I want to make is that in artificial intelligence–I am using a general notion of AI here–we have been quite successful at building machines that can learn how to transform input signals into a representation which is somehow naturally evident, like words, but we have never been so successful when the representation is hidden, or artificially imposed by theories, like meanings. Is there a way out of this impasse towards building machines that truly understand language in a general way and are not so domain specific like the ones we have today? I hope so.

Post Scriptum:  Humans express meanings naturally not by using a sophisticated symbolic representation, but by rephrasing the original sentence into another. For instance, if you ask someone what the following sentence means “I am going to the market to buy potatoes”, he or she may say something like “this means that the person who is speaking has the intention to go to a place where they typically sell fresh produce to acquire edible tubers.” This process can go on ad infinitum by substituting every word of the new sentence with its definition, for instance taken from a dictionary. So, if we do that for another round of substitutions, we may get something like “…has the determination to move in a course towards a physical environment where they give up property to others in exchange of value such as unaltered agricultural products to come into possession of fit to be eaten fleshy buds growing underground.”

Apples and Oranges

•July 17, 2012 • 1 Comment

There is a lot of talking about the performance of Apple’s Siri. An article appeared on the New York Times  a few days ago brutally destroying Siri from the point of view of its performance, and others compare it with Google Voice Search. As a professional in the field, having followed Google Voice Search closely, knowing well the people who work on it–many of them former colleagues in previous lives and respected scientists in the community–I know and trust it is definitely a state-of-the-art (and beyond) piece of technology. As far as I know it is based on the latest and greatest mathematical models, it uses a lot (and I mean “a lot”) of data, and I can trust my Google friends are squeezing this data to the last bit of information in order to get the best possible performance.  I have to admit I know very little about Siri. It is general knowledge that it uses Nuance’s speech recognizer, but beyond that we know very little. While the scientists from Google are often present at the most prestigious conferences in the field, and often present results to the community, I haven’t met anyone from Siri yet at one of these conferences, nor have I seen a scientific paper on it. I may have missed that, but I think it is Apple’s policy not to divulge what cooks inside their labs.   So, besides the speech recognizer, I don’t know the details of how Siri works, although I have some ideas about it. But the point I want to make here is that comparing Siri with Google Voice Search is a little bit like comparing Apples and Oranges (the pun is intended). Let me explain why.

When you speak into Google Voice Search, what comes out is text. Nothing more than text that represent a more or less accurate  word-by-word transcription of what you just said. The text is fed to the traditional Google search, which returns a traditional Google list of results, or “links”. Rather Siri tries to do something more. Text is an intermediate product, provided by the speech recognizer as a transcription of what  you just said. But Siri then uses the text to “understand” what you just said, and provide an answer. So, for instance, if you speak into Google and say “set up an appointment with John Doe for next Tuesday at three”, Google Voice Search will show you links to Web pages about setting up appointments, pages about Mr.  John Doe, and pages having to do with Tuesdays and the number three. Siri, instead, if it worked correctly, will pop up your calendar for next Tuesday and eventually will setup the requested appointment. How do we compare them?  We should probably compare them on something which is common to both, something that both do. For instance the word by word transcription of what you said. Speech researchers have been doing this type of evaluation for decades using a measure called word error rate (or WER), which is computed by matching the correct sequence of words for every test utterance, and the corresponding sequence of words returned by the speech recognizer, aligning them, finding out the words in common and whether some other words were spuriously inserted or deleted. Typically the WER is computed on a test set of a few thousands of utterances randomly selected in order not to introduce any bias. And the very same utterances should be fed to both systems, in order for the test to be fair. That is not an easy task to do for a non-insider on the two systems at hand, but with the solution of some technical problems, it could be done more ore less accurately.

Now Google Voice Search stops there. There is nothing beyond the textual transcription for Google (although I notice more and more some direct answers provided by Google to direct questions, a sign they are doing some sort of language understanding). In Siri, as we saw, there is more. But when we get to the realm of meaning–and not just words–things start to become complicated, and there are no easy to implement standard tests for gauging the accuracy of a language understanding system. Scientists tried many times in the past, but it was always a hard task, and worse, it was always depending very much on the application, and not easy scalable and portable to different domains. And the problem is, if we can’t measure things, we cannot improve them easily. Now, I am sure that Apple scientists have their own way to measure the average accuracy of Siri’s answers–I would have that if I were one of them–but I know it’s not an easy task.  In conclusion, we can’t really compare one with the other, since they do different things, but we can can certainly say that one makes its users happier than the other.

Singing computers

•July 14, 2012 • 1 Comment

Building a computer that speaks with the same naturalness and intelligibility of humans is not a much easier task than building a computer that understand speech. In fact it took decades to reach the quality of modern speech synthesizer, and yet the superiority of real human voice is still unbeatable. Still today, whenever possible, automated spoken dialog systems on the phone deploy prompts recorded by voice talents carefully coached by voice user interface designers, rather than speech synthesized by a computer. Having said that, it is true that  the speech synthesis and text-to-speech technology  has gone a long way from it first  attempts, and it is used in many commercial applications, including the navigator in your car.

One thing that may not be obvious is that making a computer sing is in a way easier than making it speak (contrary to that, speaking is easier for humans than singing). The reason is that speech synthesizers, in order to produce naturally sounding speech,  have to give it the right intonation and rhythm, which depend on many factors, such as the structure of the sentences,their general meaning, the specific message to convey to the listener, the context of the whole text, and so forth. Generating the right intonation and rhythm automatically and  from written text alone is not an easy feat. We have developed algorithms to do that, but our algorithms, though much better than what we had decades ago, are not perfect yet. Rather, in singing voice, the intonation and rhythm are prescribed exactly by the a score; it’s just enough to follow the music.  So a syntetic singing voice may end up sounding more natural than a corresponding, non singing, speaking voice.

In one of the most dramatic scenes of the classic 1968 Sci-Fi movie “2011, a Space Odyssey,” HAL 9000, the villain computer, sings a tune while is being deactivated by astronaut David Bowman. The tune is Days Bell:

Apparently  Stanley Kubrik and Arthur C. Clarke  visited the most prestigious technology research centers of the time in order to get ideas for the movie before they started production. Certainly they visited Bell Labs and heard the first singing computer, created in 1961 by  computer music pioneer Max Mathews.  And certainly they heard that computer perform one of its preferred songs: Daisy Bell.

Put that there!

•July 5, 2012 • Leave a Comment

One of the first multimodal interaction systems, dubbed Put-that-there, was built at the MIT Architecture Machine lab in the late 1970s by Chris Schmandt, who is now the director of the Speech and Mobility Group at MIT Media Labs.  Here is a demo from 1979, where you see the integration of speech and gesture recognition  to allow users to create and move objects on a screen. That was more than 30 years ago, and it worked pretty well … except … well, watch the video until the end…

What’s wrong with speech recognition?

•July 3, 2012 • 3 Comments

Although speech recognition is getting better and better, it keeps making mistakes that often annoy us. Much more than humans would do in similar situations. And we have been trying to make it better for decades. What’s wrong with it?

Scientists are constantly testing and trying to improve speech recognition in adverse conditions, such as when the ambient noise is quite high (like in a busy restaurant or at a cocktail party), or with highly accented speech (for instance when a foreign is talking), or when the speaker is in a situation of stress (like a fighter pilot during flight), or in presence of sever distortions (such as with a bad telephone connection or a highly reverberating room. While in all these situations human listeners can do pretty well, even if not perfectly; speech recognition systems instead degrade pretty badly. And sometimes, even the best recognizer in the best possible conditions, takes one word for another in a quite surprising manner. No matter what we do, computer speech recognition is still far from human performance.

One of the common notions, especially in the commercial and practitioner’s world, is that if we had more speech data to learn from, and faster computers, the difference between machine and human performance will go away. But that seems not to be the case. Using substantial amounts of data (thousands of hours of speech) in conjunction with faster and faster computers produces only modest improvements that seem to get closer and closer to the flat zone of diminishing returns. So, what’s wrong with speech recognition?

Recently  at the International Computer Science Institute in Berkeley,  under the auspices of IARPA and AFRL, we started a project called OUCH (as in Outing Unfortunate Characteristics of Hidden Markov Models) with the goal of giving a stab at what’s wrong with the current speech recognition and why, no matter what we do, we can’t get as close to human performance as we would like. Nelson Morgan is leading the project and working with Steven Wegmann and Jordan Cohen.

One of the issues has to do with the underlying model. All current speech recognizers are based on a statistical/mathematical model (known as Hidden Markov Model, or HMM) that was introduced in the mid 1970s. In order to be able to work with it and avoid unbearable complexity, the model makes strong assumptions, which are not verified in practice. For instance, one of the strongest assumptions is that consecutive spectral snapshots of speech (we call them “frames”) are statistically independent from each other. We know very well that this is not true, but without making believe it were, the math of HMMs would become too complex to handle, and we would never have enough data to be able to estimate their parameters effectively. In science we often make simplifying assumptions just for the purpose to be able to handle things, even if we know that the assumptions are wrong, hoping that inaccuracies caused by the assumptions are small enough to be neglected. This is the case with Newtonian physics of masses and gravitation, which works for small-scale objects, but turns to be wrong when we talk about masses of the size of planets, stars, and galaxies. While we can predict with a fairly high accuracy the motion of a tennis ball, we will never be able to predict and explain events at the galactic level using Newton’s equations. No matter what, we would need a different model to account for that scale.  Similarly, while we are able to recognize speech with decent accuracies using HMMs in good acoustic conditions, if we want to reach human comparable performance in adverse situations, we would probably need to use a different model. Which model we don’t know yet, but definitely we know that HMMs as we know them may be inadequate, especially for human-like accuracies. Indeed HMM have done have served us well until now, but we need something different to be able to move ahead.  We hope that this research will shed some light on the next gerneration of computers that understand speech.

An ante-litteram vision of Siri

•June 26, 2012 • 2 Comments

Some of you may remember the Apple’s knowledge navigator video posted here. It was released in 1987 as a vision of a future tight integration of advanced technologies on a flat tablet computer. The computer in this entertaining vision clip sports a fancy Web-like interface, a realistic avatar, touch screen with gesture recognition, teleconferencing, speech recognition, natural language understanding, all of that  from 5 to 15 years before they were actually available in some shape or form. I remember watching this during my first years at Bell Labs when the recognition of connected digits was still one of the main challenges, a vocabulary of 1,000 words was considered a large one (a million word vocabulary does not scare anyone in speech recognition today), and all we had for training speech recognizer was a few thousand utterances (today we talk about 100s of millions of utterances).