Demystifying Speech Recognition … but not too much
I really appreciate when people try to give a simplified view of technology with the goal to let the general public understand what’s behind the hood, and how complex is, oftentimes, to make things works properly. That is the goal I had in mind when I embarked on the project of writing The Voice in the Machine. However, I believe, we should not simplify too much, to the point of creating the perception that, after all, the problem is really simple, and everyone can do that … That oversimplification, making believe that “… after all it’s easy to do it, …and why companies and researchers spend so many cycles in trying to solve problems that anyone with decent programming skills can approach…” is deceiving the general public, and can produce false expectations. A couple of days ago I stumbled into a white-paper entitled “Demystifying speech recognition” portraying speech recognition as a straightforward process based on a transformation from audio to phonemes (the basic speech sounds), phonemes to words, and words to phrases, with the “audio-to-phoneme” step described as a simple table lookup. Unfortunately speech recognition does not work like that. Or at least, let me say, high-performance, state-of-the-art speech recognition, unfortunately, does not work that way. Not that it cannot be explained in a simple way, but there are a few important differences from what was described in that white paper.
First, the idea to start with a transformation of audio into phonemes, very attractive for different reason, is quite old and does not work. Many people tried that from the early experiments in the 1960s without much success for reasons which I will explain later. Even recently there are some commercial recognizers which, for good reason, use a phonetic recognizer as a front end. Without going into details, those recognizer are mostly used offline for extracting analytics from large amounts of human speech, and not intended for human-machine interaction, and there are some reasons why a phonetic recognizer would be preferable for that. However any serious experimenter would tell you that using a phonetic recognizer as a front end of an interactive system, in other words a system where the ultimate goal is to get the words, or the concepts behind the words with high accuracy, would show degraded performance when compared to a “traditional” modern speech recognizer that goes directly from audio to words without using phonemes as an intermediate step.
The point, which I call “the illusion of phonetic segmentation” in my book, is that in a pattern recognition problem with a highly variable and noisy input (like speech), making decisions on intermediate elements (like the phonemes) as a step towards higher level targets (like words), introduces errors that will greatly affect the overall accuracy (which is measured not on the phonemes but on the words). And even if we had perfect phonetic recognition (which we don’t…and by the way, a phoneme is an abstract linguistic invention, while words are more concrete phenomena … see my previous post), the “phonetic” variation of word pronunciation (as in poteito vs. potato, or the word “you” in “you all” pronounced as “y’all”) would introduce further errors. So, a winning strategy in patter recognition in general, and in speech recognition in particular, is that of not taking any decision until you are forced to, in other words until you get to your target output (words, concepts, or actions).
A friend of mine used to say “A good general is a lazy general”, meaning that when you have to take an important decision, the more you delay it to gather more data, the better the chance is to take a good decision, eventually. The same concept applies to speech recognition. In fact modern state-of-the art speech recognizer (yet based on ideas developed in the 1970s) do not take any decision until they have gained enough evidence that the decision–typically on which words where spoken–is, so to speak, the best possible decision given the input data and the knowledge the system has. And this is not done using simple frame-to-phoneme mapping tables, but using sophisticated statistical models (called Hidden Markov Models, or HMMs) of phonetic units that are estimated over thousands of hours of training speech. Yet, one of the open questions today is whether we have reached the end-of-life of these models and whether we need to look for something better, as I discussed in a previous post. Now, can we explain Hidden Markov Models in a simple way to make laypeople understand what they are, and help demystifying speech recognition without describing it in the wrong way? Yes, we can, and I’ll try to do that in one of my future posts. But the point here is that, as I have said before, speech recognition reached a stage where it can be successfully deployed in many applications, also thanks to the work of thousands of people who developed and improved the sophisticated algorithms and math behind it. However, to move to the next stage, we need to continue to work on it. The problem is neither simple, nor solved.
The comment about “not taking any decision until you are forced to”
seems to be applicable in other areas as well. See the new book,
WAIT – The Art and Science of Delay by Frank Partnoy
http://frankpartnoy.com/wait/
Why Procrastination is Good for You
http://www.smithsonianmag.com/science-nature/Why-Procrastination-is-Good-for-You-162358476.html
University of San Diego professor Frank Partnoy argues that the key
to success is waiting for the last possible moment to make a decision.