Apples and Oranges

There is a lot of talking about the performance of Apple’s Siri. An article appeared on the New York Times a few days ago brutally destroying Siri from the point of view of its performance, and others compare it with Google Voice Search. As a professional in the field, having followed Google Voice Search closely, knowing well the people who work on it–many of them former colleagues in previous lives and respected scientists in the community–I know and trust it is definitely a state-of-the-art (and beyond) piece of technology. As far as I know it is based on the latest and greatest mathematical models, it uses a lot (and I mean “a lot”) of data, and I can trust my Google friends are squeezing this data to the last bit of information in order to get the best possible performance. I have to admit I know very little about Siri. It is general knowledge that it uses Nuance’s speech recognizer, but beyond that we know very little. While the scientists from Google are often present at the most prestigious conferences in the field, and often present results to the community, I haven’t met anyone from Siri yet at one of these conferences, nor have I seen a scientific paper on it. I may have missed that, but I think it is Apple’s policy not to divulge what cooks inside their labs. So, besides the speech recognizer, I don’t know the details of how Siri works, although I have some ideas about it. But the point I want to make here is that comparing Siri with Google Voice Search is a little bit like comparing Apples and Oranges (the pun is intended). Let me explain why.

When you speak into Google Voice Search, what comes out is text. Nothing more than text that represent a more or less accurate word-by-word transcription of what you just said. The text is fed to the traditional Google search, which returns a traditional Google list of results, or “links”. Rather Siri tries to do something more. Text is an intermediate product, provided by the speech recognizer as a transcription of what you just said. But Siri then uses the text to “understand” what you just said, and provide an answer. So, for instance, if you speak into Google and say “set up an appointment with John Doe for next Tuesday at three”, Google Voice Search will show you links to Web pages about setting up appointments, pages about Mr. John Doe, and pages having to do with Tuesdays and the number three. Siri, instead, if it worked correctly, will pop up your calendar for next Tuesday and eventually will setup the requested appointment. How do we compare them? We should probably compare them on something which is common to both, something that both do. For instance the word by word transcription of what you said. Speech researchers have been doing this type of evaluation for decades using a measure called word error rate (or WER), which is computed by matching the correct sequence of words for every test utterance, and the corresponding sequence of words returned by the speech recognizer, aligning them, finding out the words in common and whether some other words were spuriously inserted or deleted. Typically the WER is computed on a test set of a few thousands of utterances randomly selected in order not to introduce any bias. And the very same utterances should be fed to both systems, in order for the test to be fair. That is not an easy task to do for a non-insider on the two systems at hand, but with the solution of some technical problems, it could be done more ore less accurately.

Now Google Voice Search stops there. There is nothing beyond the textual transcription for Google (although I notice more and more some direct answers provided by Google to direct questions, a sign they are doing some sort of language understanding). In Siri, as we saw, there is more. But when we get to the realm of meaning–and not just words–things start to become complicated, and there are no easy to implement standard tests for gauging the accuracy of a language understanding system. Scientists tried many times in the past, but it was always a hard task, and worse, it was always depending very much on the application, and not easy scalable and portable to different domains. And the problem is, if we can’t measure things, we cannot improve them easily. Now, I am sure that Apple scientists have their own way to measure the average accuracy of Siri’s answers–I would have that if I were one of them–but I know it’s not an easy task. In conclusion, we can’t really compare one with the other, since they do different things, but we can can certainly say that one makes its users happier than the other.

One response to “Apples and Oranges”

Peter

July 21, 2012

Actually, Google is announcing a Siri-like system for the new Android Jelly Bean (along with the standard voice search):

http://www.android.com/whatsnew/

it would be great if their NLU testing approaches are reported, since they’re more science/community friendly than Apple.

THE VOICE IN THE MACHINE