Singing computers

•July 14, 2012 • 1 Comment

Building a computer that speaks with the same naturalness and intelligibility of humans is not a much easier task than building a computer that understand speech. In fact it took decades to reach the quality of modern speech synthesizer, and yet the superiority of real human voice is still unbeatable. Still today, whenever possible, automated spoken dialog systems on the phone deploy prompts recorded by voice talents carefully coached by voice user interface designers, rather than speech synthesized by a computer. Having said that, it is true that  the speech synthesis and text-to-speech technology  has gone a long way from it first  attempts, and it is used in many commercial applications, including the navigator in your car.

One thing that may not be obvious is that making a computer sing is in a way easier than making it speak (contrary to that, speaking is easier for humans than singing). The reason is that speech synthesizers, in order to produce naturally sounding speech,  have to give it the right intonation and rhythm, which depend on many factors, such as the structure of the sentences,their general meaning, the specific message to convey to the listener, the context of the whole text, and so forth. Generating the right intonation and rhythm automatically and  from written text alone is not an easy feat. We have developed algorithms to do that, but our algorithms, though much better than what we had decades ago, are not perfect yet. Rather, in singing voice, the intonation and rhythm are prescribed exactly by the a score; it’s just enough to follow the music.  So a syntetic singing voice may end up sounding more natural than a corresponding, non singing, speaking voice.

In one of the most dramatic scenes of the classic 1968 Sci-Fi movie “2011, a Space Odyssey,” HAL 9000, the villain computer, sings a tune while is being deactivated by astronaut David Bowman. The tune is Days Bell:

Apparently  Stanley Kubrik and Arthur C. Clarke  visited the most prestigious technology research centers of the time in order to get ideas for the movie before they started production. Certainly they visited Bell Labs and heard the first singing computer, created in 1961 by  computer music pioneer Max Mathews.  And certainly they heard that computer perform one of its preferred songs: Daisy Bell.

Put that there!

•July 5, 2012 • Leave a Comment

One of the first multimodal interaction systems, dubbed Put-that-there, was built at the MIT Architecture Machine lab in the late 1970s by Chris Schmandt, who is now the director of the Speech and Mobility Group at MIT Media Labs.  Here is a demo from 1979, where you see the integration of speech and gesture recognition  to allow users to create and move objects on a screen. That was more than 30 years ago, and it worked pretty well … except … well, watch the video until the end…

What’s wrong with speech recognition?

•July 3, 2012 • 3 Comments

Although speech recognition is getting better and better, it keeps making mistakes that often annoy us. Much more than humans would do in similar situations. And we have been trying to make it better for decades. What’s wrong with it?

Scientists are constantly testing and trying to improve speech recognition in adverse conditions, such as when the ambient noise is quite high (like in a busy restaurant or at a cocktail party), or with highly accented speech (for instance when a foreign is talking), or when the speaker is in a situation of stress (like a fighter pilot during flight), or in presence of sever distortions (such as with a bad telephone connection or a highly reverberating room. While in all these situations human listeners can do pretty well, even if not perfectly; speech recognition systems instead degrade pretty badly. And sometimes, even the best recognizer in the best possible conditions, takes one word for another in a quite surprising manner. No matter what we do, computer speech recognition is still far from human performance.

One of the common notions, especially in the commercial and practitioner’s world, is that if we had more speech data to learn from, and faster computers, the difference between machine and human performance will go away. But that seems not to be the case. Using substantial amounts of data (thousands of hours of speech) in conjunction with faster and faster computers produces only modest improvements that seem to get closer and closer to the flat zone of diminishing returns. So, what’s wrong with speech recognition?

Recently  at the International Computer Science Institute in Berkeley,  under the auspices of IARPA and AFRL, we started a project called OUCH (as in Outing Unfortunate Characteristics of Hidden Markov Models) with the goal of giving a stab at what’s wrong with the current speech recognition and why, no matter what we do, we can’t get as close to human performance as we would like. Nelson Morgan is leading the project and working with Steven Wegmann and Jordan Cohen.

One of the issues has to do with the underlying model. All current speech recognizers are based on a statistical/mathematical model (known as Hidden Markov Model, or HMM) that was introduced in the mid 1970s. In order to be able to work with it and avoid unbearable complexity, the model makes strong assumptions, which are not verified in practice. For instance, one of the strongest assumptions is that consecutive spectral snapshots of speech (we call them “frames”) are statistically independent from each other. We know very well that this is not true, but without making believe it were, the math of HMMs would become too complex to handle, and we would never have enough data to be able to estimate their parameters effectively. In science we often make simplifying assumptions just for the purpose to be able to handle things, even if we know that the assumptions are wrong, hoping that inaccuracies caused by the assumptions are small enough to be neglected. This is the case with Newtonian physics of masses and gravitation, which works for small-scale objects, but turns to be wrong when we talk about masses of the size of planets, stars, and galaxies. While we can predict with a fairly high accuracy the motion of a tennis ball, we will never be able to predict and explain events at the galactic level using Newton’s equations. No matter what, we would need a different model to account for that scale.  Similarly, while we are able to recognize speech with decent accuracies using HMMs in good acoustic conditions, if we want to reach human comparable performance in adverse situations, we would probably need to use a different model. Which model we don’t know yet, but definitely we know that HMMs as we know them may be inadequate, especially for human-like accuracies. Indeed HMM have done have served us well until now, but we need something different to be able to move ahead.  We hope that this research will shed some light on the next gerneration of computers that understand speech.

An ante-litteram vision of Siri

•June 26, 2012 • 2 Comments

Some of you may remember the Apple’s knowledge navigator video posted here. It was released in 1987 as a vision of a future tight integration of advanced technologies on a flat tablet computer. The computer in this entertaining vision clip sports a fancy Web-like interface, a realistic avatar, touch screen with gesture recognition, teleconferencing, speech recognition, natural language understanding, all of that  from 5 to 15 years before they were actually available in some shape or form. I remember watching this during my first years at Bell Labs when the recognition of connected digits was still one of the main challenges, a vocabulary of 1,000 words was considered a large one (a million word vocabulary does not scare anyone in speech recognition today), and all we had for training speech recognizer was a few thousand utterances (today we talk about 100s of millions of utterances).

The mythical 10 years

•June 18, 2012 • 1 Comment

Speech recognition is one of those technologies who have been around for  a while, but have never become mature enough to be considered established and part of everyday’s life like, instead, digital cameras, retina displays, and bluetooth. However, for a few years now speech recognition technology has be “sort-of” working so as some of us started building applications and products around it; Siri and Google voice search are now the most popular evidence of that. But in fact, although it allows building useful applications, speech recognition by computers is still far from the human ability to deal with highly noisy, highly distorted, or highly accented speech. Thinks of our ability to understand speech at a cocktail party: speech recognition by computers is light years away from that. Speech recognition is still fragile and brittle. Because of that, speech recognition has always been “almost there … but not quite .” There is always a sense that computer’s speech recognition capabilities will be close to those of humans, well, in 5 to 10 years from now. And that statement has been true every year of the past 50+ years.

Roger K. Moore, a long timer in speech research, a professor at the University of Sheffield, UK, and a long time friend, has been conducting surveys targeted to senior and young speech scientists trying to determine when they think speech recognition will be a solved problem, so to speak. The results from 3 surveys, conducted in 1997, 2003, and 2009, are reported in this paper.  As an example of the survey results, when asked when do they think  “It will possible to hold a telephone conversation with an automatic chat-line system for more than 10 minutes without realizing it isn’t human,” the median answer in all three surveys was … well … in year 2050 … meaning we are slowly getting close to that date. “Never” is the median answer to the question about when do speech recognition experts  think ‘There will be no more need for speech research” (we speech researchers have some job security, indeed). And … when do speech researchers themsleves think that “speech recognition will be commonly available at home?” .. well, the answer is mostly “…about 10 years from now …” and that answer was the same in 1997, 2003, and 2009. That is a proof of the moving 10 year horizon of pervasive speech adoption. One of the funnies question of the survey is: in which year  you think the following statement will be true “A leading cause of time away from work is being hoarse from talking all the time, and people buy keyboards as an alternative to speaking.” If you want to know what speech scientists think with respect to that, read the paper.

However the situation is not that grim. Some interesting applications of speech recognition are out there, many people try to make the technology better, and we all still believe in it. Otherwise we wouldn’t be writing and reading blogs like this. More to come, in the next posts, to convince you that “speech recognition” is still hot. Stay tuned!

Can machines really think?

•June 18, 2012 • Leave a Comment

Can machines really think? This question has been haunting machine intelligence  experts and philosopher way before today’s Siri’s speech understanding, and IBM Watson’s Jeopardy! question answering challenge. Even though today we are not thinking anymore about whether  machines can think or not, the question was quite popular in the heydays of Artificial Intelligence, the cold war, and the pioneers of computer science. The following video includes clips from a 1950 interview replayed in the 1992 PBS documentary The Machine That Changed the World. In the interview, MIT’s professor and scientific advisor to the White House Jerome Wiesner says (it is 1950, and people smoke pipe in public) that machines will be actually thinking in a matter of “four or five years.” Around the same time frame, more than 60 years ago, the “father of machine perception”, Oliver Selfridge from Lincoln Labs  has no doubts that “machines can and will think” even though he is sure that his daughter will never marry a computer. And in a 1960 Paramount News feature dubbed “Electronic brain translates Russian to English” (electronic brain? eek…) an early machine translation engineer states, without the slightest shade of doubt in his voice, that if “their experiments” go well, they will be able to translate the whole output from the Soviet Union in just a few hours’ computer time a week. In the video there is even a brief appearance of Claude Shannon, the father of “information theory.” Enjoy the video:

The Voice in the Machine is Alive

•June 16, 2012 • Leave a Comment

Hello there. A couple of months after the publication of my book, The Voice in the Machine: Building Computers that Understand Speech, I decided to start a new blog on the same theme. Having the fortune of working at ICSI, the International Computer Science Institute,  one of the few independent advanced research places where  computer speech, language, and AI research is still vivid and unfettered (together with many other disciplines, such as networking, security, neurosciences, bio-informatics, and computer architectures), and being computers that understand speech one of my lifetime loves, I will use this blog to collect my notes and my points of view on the evolution of the science and technology of what i like to call “talking machines”. Having the additional fortune of having worked on both sides of the research-industry chasm for more three decades, I will do my best to put down into words my presumably unbiased point of view on the related science, technology, and business. Talk to you soon…stay tuned.