Programming Your Own Jibo

•August 6, 2014 • Leave a Comment

Jibo Blog

This is Andy Atkins, VP of Engineering here at Jibo.

First off, I want to send out a big “Thank You” to all of you for the overwhelming show of support you have given Jibo since we’ve launched our crowdfunding campaign. Your contributions, questions, and the “buzz” you’ve helped create, confirms that we’re on to something here, and we want to be as transparent as we can about what we’re doing and where we are.

As I’ve been wading through the email that has come in since we’ve launched the campaign, it is clear that many of you are dying to learn more about Jibo, its applications and capabilities, privacy and security, as well as how one might develop additional “skills” for Jibo. In response, we’ll continue to update our FAQs to address as many of your shared questions as we can.

Over the coming weeks and months, I’ll also…

View original post 1,187 more words

High Tech AND High Touch

•July 15, 2014 • Leave a Comment

Aging and technology. From Jibo’s Blog at

Jibo Blog

Two of the most undeniable trends in our world today are the unprecedented aging demographics and the ever-increasing pace of technology innovation.

In the past 100 years, we have added 30 years to average life expectancy and the 85+ age group is the fastest growing segment of the population. By 2050, the population of centenarians is projected to reach nearly 6 million. It’s a whole new – and increasingly grey – world. At the same time, we are seeing shortages across the board in aging-related care professions including geriatricians, certified nurse assistants and home care aides.

In today’s connected world, technology is streamlining businesses processes, enabling new business models and changing the way people interact with each other. We have seen technology transform industries and the demographic imperative ahead of us is forcing individuals, families, communities and countries to think in new ways about new models and opportunities at the…

View original post 644 more words

We Are Robot

•July 4, 2014 • Leave a Comment

I am Robot! Patrick Hanlon speculates on the future of social robots.

Jibo Blog

This blog also appeared on Forbes website.

We are entering a new era of technological connectivity. We already have ‘smart products’ and ‘wearable devices’ and ‘the Internet of Things’.

Now there are robots, too.

Actually, this is not new either. Robots have been utilized in manufacturing for the last two decades, lumbering back and forth between assembly points, dropping off raw materials or delivering assembly parts and final products.

The difference now is that these new robots do not lumber. They skitter. They wink at you. They are deliberately designed, much like C3PO, to mimic our actions and register an emotional context. Whereas robots of the past worked for us, the latest versions (social robots) want to work with us. More significantly, they want us to befriend them.

This is a dramatic shift in thinking that may take some getting used to. Or not. In fact, the device that…

View original post 1,209 more words

The Next Era of Conversational Technology

•July 2, 2014 • Leave a Comment

This blog post appeared originally on Jibo Blog

In 1922, the Elmwood Button Co. commercialized a toy that could respond to voice commands. The toy, called Radio Rex, was a dog made of celluloid that would leap out of its house when you called its name: “Rex!” That was two decades before the first digital computer was even conceived, and at least three generations before the Internet would jet our lives from the real world into the virtual realms of e-commerce, social networks, and cloud computing. Radio Rex’s voice recognition technology, so to speak, was quite crude. Practically speaking, the vibration produced by the “r” in its name, when spoken loudly, would cause a thin metal blade to open a circuit that powered the electromagnet that held Rex in its house, causing Rex to jump out. A few years ago, I watched in awe one of the few remaining original Radio Rex toys in action. It was still working.

roberto-featured 2

Some people recognize Radio Rex as the first speech recognition technology applied to a commercial product. Whether that could be called speech recognition or not, we have to agree that the little celluloid dog was probably one of the first examples of a voice controlled consumer robot. While rudimentary, I can imagine the delight and wonder that people felt when Rex responded to their own voice. Magic!

Indeed, speech recognition has something magical. That’s why I have spent more than 30 years of my life building machines that understand speech, and many of my colleagues have done the same. The machines we have built “understand” speech so well that there is a billion dollar industry based on this specific technology. Millions of people talk to machines every day, whether it is to get customer support, find a neighborhood restaurant, make an appointment in their calendar, or search the Web.

But these speech-understanding machines do not have what Radio Rex had, even in its own primitive simplicity: an animated body that would respond to voice. Today, social robots do. And even more: they can talk back to you, see you, recognize who you are and where you are, respond to your touch, display text or graphics, and express emotions that can resonate with your own.

When we consider voice interaction for a social robot, what type of speech and language capabilities it should have? What are the challenges? Yes, of course Siri and Google Now have paved the way to speech-based personal assistants, but despite their perceived popularity, is speech recognition technology ready for a consumer social robot? The answer is yes.

First, let me explain the difference among the speech recognition technologies used for telephone applications (a-la “Please say: account, payments, or technical support”), for personal assistants (like Siri, Google Now, and the most recently Cortana) and for a social robot.

In telephone-based applications, the speech recognition protocol is synchronous and typically initiated by the machine. The user is expected to speak only during well-determined intervals of time in response to a system prompt (“Please say …”). If the user does not speak during that interval, or if the user says something that is not encoded in the system, the speech recognition machine times out, and the system speaks a new prompt. This continues until something relevant is said and recognized, or either the machine or user lose their patience and bail out. We have all experienced this frustration at some point.

In telephone-based systems, any type of information has to be conveyed through the audio channel, including instructions for users on what to say, the information requested, and the possible clarification of any misunderstanding between the human and the machine (including the infamous “I think you said …”). Anything spoken has to be limited in time, in a rigid, turn-taking fashion between the system and the user. That is why telephone-based systems are so “clunky”, to the point where more often than not, they are quite annoyingly mechanical and repetitive. Although I have built some of the most sophisticated and well-crafted telephone-based speech recognition systems for over a decade — and these systems are used by millions of people every week, — I never heard anyone say: “I love my provider’s automated customer care service. It is so much fun to use it that I call it every day!”

Then came the smartphone. The fact that a smartphone includes a telephone is marginal. A smartphone is a computer, a camera, a media appliance, an Internet device, a computer game box, a chatterbox, a way to escape from boring meetings, and yes, also a telephone, for those who insist on making calls. Although it hasn’t happened yet, smartphones have the potential to make clunky, telephone-based speech understating systems obsolete. Why? Because with a smartphone, you are not forced to communicate only through the narrow audio channel. In addition to speech, you can enter text and use touch. A smartphone can talk back to you, and display text, images, and videos. The smartphone speech recognition revolution started with Siri and Google voice search. Today, more and more smartphone services are adopting speech as an alternative interface blended with text and touch modalities.

The distinction between telephone-based, audio-only services versus smartphone speech understanding is not just in the multi-modality of the communication protocol. The advent of fast broadband Internet has freed us from the necessity of having all the software on one device. Speech recognition can run in the cloud with the almost unlimited power of server farms and can truly recognize pretty much everything that you say.

Wait a minute. Just recognize? How about understanding the meaning of what you say?

There is a distinction between recognition, which is the mere transcription of the spoken words, and understanding, which is the extraction and representation of the meaning. The latter is the domain of natural language processing. While progress in speech recognition has been substantial and persistent over the past decades, natural language has moved forward more slowly. Siri and its cloud-based colleagues are capable of understanding more and more, and even some of the nuances of our languages. The time for speech recognition and natural language understanding is finally ripe for natural machine and human communication.

But…is that all? Is talking to our phones and tablets or open rooms — as in the upcoming home automation applications — the ultimate goal of spoken communication with machines?

No. Social robots are the next revolution.

It has been predicted by the visionary work of pioneers like Cynthia Breazeal (read Cynthia’s blog here), corroborated and validated by the recent announcements of social robot products coming to market. This is made possible by the confluence of the necessary technologies that are available today.

Social robots have everything that personal assistants have—speech, display, touch—but also a body that can move, a vision system that can recognize local environments, and microphones that can locate and focus on where sounds and speech are coming from. How will a social robot interact with speech among all the other modalities?

We went from prompted speech, like in the telephone systems, to personal assistants that can come to life on your smartphone. For a social robot, there should neither be a “speak now” timeout window, nor the antiquated “speak after the beep” protocol, nor a button to push to start speaking. Instead, a social robot should be there for you all the time and respond when you talk to it. You should be able to naturally cue your robot that you are addressing it by speaking its name (like Star Trek’s “Hello Computer!”) or simply by directing your gaze toward it while you talk to it. Fortunately, the technology is ripe for this capability, too. Highly accurate hot-word speech recognition—like in Google Now—and phrase spotting (a technology that allows the system to ignore everything that is spoken except for defined key-phrases) are available today.

While speech recognition accuracy is always important, we should always consider that social robots can have multiple communication channels: speech, touch, vision, and possibly others. In many cases, the different channels can reinforce each other, either simultaneously or sequentially, in case one of them fails. Think about speech in a noisy situation. When speech is the only alternative, the only recourse is to repeat (at nauseam) the endless “I did not understand what you said, please say that again?” For social robots, however, touch and gesture are powerful alternatives to get your intent across when placed in situations where speech is difficult to process.

In certain situations, two alternative modalities can also complement each other. Consider the problem of identifying a user. This can be done either via voice with a technology, called speaker ID, or through machine vision using face recognition. These two technologies can be active at the same time and cooperate, or used independently — speaker ID when the light is too low, or face ID when the noise is too high.

Looking ahead, speech, touch, and image processing can also cooperate to recognize a person’s emotion, locate the speaker in the room, understand the speaker’s gestures, and more. Imagine the possibilities when social robots can understand natural human behavior, rather than the stilted conventions we need to follow today for machines to understand what we mean.

I have spent my career working at the forefront of human-machine communication to make it ever more natural, robust, textured – and, yes, magical. Now, we are at the dawn of the era of social robots. I am thrilled to help bring this new experience to the world. Just wait and see.

•March 14, 2013 • Leave a Comment

Musings on Semantics

•October 9, 2012 • 1 Comment

This is a re-posting of my editorial on the latest ICSI’s Newsletter.

W. Brian Arthur, in his book The Nature of Technology: What It Is and How It Evolves, describes the evolution of technology as a combinatorial process. Each new technology consists in a combination of existing technologies that “beget further technologies.” Moreover, each technology springs from the harnessing of one or more physical, behavioral, mathematical, or logical principles that constitute its foundation. Innovation proceeds either through the establishment of new principles—which is typically, but not only, the domain of science—or through new combinations of existing technologies.

Not all technologies are alike, however; sometimes a single new technology, an enabler, generates a myriad of new possibilities that lead to the creation of new industries and new economies, and in some very rare occasions contribute to the definition of a new era of our civilization. Such were the steam engine, the digital computer, and the Internet.

We may find ourselves wondering what the next enabler will be. Of course no one knows for sure, and any attempt to make a prediction will most likely be wrong. But researchers have a special role in our technological future. They cannot predict what the future will be but, to paraphrase Alan Kay’s famous quote, they can attempt at creating it. Well, looking at the trends of current research in information technology, we can definitely see that the attempt to create a new future based on automatically deriving higher levels of data understanding is today one of the most challenging endeavors researchers are embarked on.  Let me be more specific.

Our era is characterized by an unprecedented amount of information. It is no surprise that a significant amount of technological research today is devoted to the creation, management, and understanding, by computers, of the wealth of data – text, images, and sounds around us. But it is the understanding, which is the most challenging among those problems and the farthest from a satisfactory solution. A large portion of the research community aims to devise algorithms and systems to automatically extract the meaning, the semantics, from raw data and signals. In fact, a lot of the research carried out at ICSI, as at many other research centers, can be described as “looking for meaning in raw data.”  Research on natural language and visual and multimedia signals is diving into the problem of deeper understanding. Beyond the mere (so to speak) recognition of words and entities in language and objects in images, we are now trying to get to the deeper content, such as metaphors and visual concepts. But it’s not just that. The research in network security, for instance, is trying to assign meanings to the patterns found in streams of communication among computers in order to detect possible intrusions and attacks, while theoretical computer scientists are trying to find meaning in DNA strings and in networks of brain activity.

However, we are not quite there yet. Think, for instance, about the promises made by the vision of the semantification of the whole Web. The vast majority of the Web comprises unstructured raw data: text, images, audio and videos. Tim Berners Lee was the first to envision a semantic web, and many have been working toward that dream, with limited degrees of success. Even though many agree on ways to encode a semantic web, and Google’s knowledge graph is an example of one of the most advanced large-scale attempts to structure factual knowledge and make it readily available to everyone, a full semantic Web is not there yet. The knowledge graph starts from existing structured knowledge, for instance the facts about Albert Einstein life, and connects that structured knowledge to Web searches. The knowledge graph includes millions of entries, which is an infinitesimally small number compared to the vast universe of the Web. Are you and your papers on the knowledge graph? Are all recent world facts, blog entries, and opinions on the European financial crisis in the world graph? Maybe they will be, but the question of coverage and that of keeping the information fresh and updated is yet to be solved. And an even more serious issue: the Web is not just text. For instance, the amount of video on the Web is growing at a mindboggling rate. Some of the recent published statistics about YouTube estimate that 72 hours of video are uploaded every minute, over 4 billions hours are watched every month, and in 2011 alone, YouTube had more than 1 trillion views, the equivalent of over 140 views for every person on Earth.

It is true that we have ways to encode semantics into Web pages, as seen in the work of W3C; semantic representation languages, like OWL are widely used today.   But with the Web growing at dizzying speed any attempt to manually annotate every unstructured element—text or video—with its meaning or something closely related to it, is bound to fail. If we want to fulfill the dream of a fully semantic Web, we need methods for automatically understanding text, images, and videos, including speech, music, and general audio.

The enabling potential of a fully semantic Web is huge. A fully semantic Web will change the concept of searching the Web into that of asking questions of it and getting answers, not just, as is sometimes possible today, in restricted domains, but everywhere, about any topic, no matter the size, or popularity, or language. It will help transform mere facts into structured data and actionable information, not just about Albert Einstein and other famous people, but also about you, the grocery store at the corner, and what your friends write on their blogs. It will complete the process of moving from raw data—the gazillions of pages of online text and video—toward higher abstractions of knowledge, understanding, and wisdom. And that’s not all. Semantification of the whole Web will enable the flourishing of new areas of the information industry that are not possible today, or that are possible with a lot of handcrafting and ad-hoc solutions in limited domains and hardly scalable to other domains, languages, and media. It will allow us to interact with the Web as we do with humans, asking it questions in the same way we ask human experts questions. It will allow us to automatically compare sources of information for accuracy and truthfulness, even if they are in different languages.

However, the true full semantification of the Web is not just an ambitious dream, but a necessity. We may reach a point, in the not too distant future, when the Web will be so large that current search methods will no longer be the most effective way to find the information we need in the format we need it. How do we sort through an increasingly vast number of documents, and not just text, on a particular topic, with many independent and conflicting opinions, comprising true and false statements and different points of view? We already have to use specialized search engines to find the cheapest flight among independent travel sites and aggregators and to find the most economical reseller of a particular item. That’s possible today  because of the underlying structured form of the commercial sites. In a way, they are already semantified. But think, for a minute, of doing the same thing with unstructured information: raw text, audio, images, and video. Think about searching for documents that “have the same meaning” of another document, regardless of their language, form, and wording.

I don’t know if we will see a full semantification of the Web in our lifetimes. I don’t know if that’s even possible, or whether it is that great enabler we dream of. But one thing is certain: research is clearly moving toward a deeper understanding of the information around us, and if it is successful, we will be able to experience a higher level of social, economic and political influence of the technologies in our lives.  The full semantification of the Web, whenever it will happen, will be a game changer of enormous proportions, an enabler of industries and services which will impact all aspects of our lives. We are working on it.

Open the pod bay doors … HAL

•August 10, 2012 • Leave a Comment

When astronaut Dave Bowman tries to go back to the mothership, HAL refuses to open the pod bay door with the calm and categorical “I am sorry Dave, I am afraid I can’t do that” famous line in one of the most dramatic scenes of the movie 2001 a Space Odyssey.

Now, imagine HAL was actually built with the available speech recognition technology of 2001. This is, more or less, how the conversation would have been instead, according to my friend VUI designer and sage Jonathan Bloom:

HAL: Welcome to HAL. Please listen carefully, as our options have changed. You can say, “Open the pod bay doors,” “Play chess” . . .
Dave: Open the pod bay doors, HAL.
HAL: Sorry, I didn’t understand. You can say, “Open the pod bay doors” or press 1, “Play chess” or press 2, “Learn the alphorn” or press 3 . . .
Dave: <presses 1>
HAL: Around what time would you like the pod bay doors to open? You can say “Today,” “Tomorrow,” . . .
Dave: Now! Damn you!
HAL: Main menu. Please listen carefully, as our options have changed. You can say, “Open the pod bay doors,” or press 1, “Play chess” or press 2 . . .
Dave: <emphatically presses 1 multiple times>
HAL: I’m really, really sorry. Even though your call is really, really important to us, this mission is even more important, and I can’t allow you to jeopardize it.
Dave: I don’t know what you’re talking about, HAL.
HAL: I know you and Frank were planning to press 0 for an agent, and I’m afraid that’s something I cannot allow to happen.

Demystifying Speech Recognition … but not too much

•August 8, 2012 • 1 Comment

I really appreciate when people try to give a simplified view of technology with the goal to let the general public understand what’s behind the hood, and how complex is, oftentimes, to make things works properly.  That is the goal I had in mind when I embarked on the project of writing  The Voice in the Machine. However, I believe, we should not simplify too much, to the point of creating the perception that, after all, the problem is really simple, and everyone can do that … That oversimplification, making believe that “… after all it’s easy to do it, …and why companies and researchers spend so many cycles in trying to solve problems that anyone with decent programming skills can approach…” is deceiving the general public, and can produce false expectations.   A couple of days ago I stumbled into a white-paper  entitled “Demystifying speech recognition” portraying  speech recognition as a straightforward process based on a transformation from audio to phonemes (the basic speech sounds), phonemes to words, and words to phrases, with the “audio-to-phoneme” step described as a simple table lookup. Unfortunately speech recognition does not work like that. Or at least, let me say, high-performance, state-of-the-art speech recognition, unfortunately, does not work that way. Not that it cannot be explained in a simple way, but there are a few important differences from what was described in that white paper.

First, the idea to start with a transformation of audio into phonemes, very attractive for different reason, is quite old and does not work. Many people tried that from the early experiments in the 1960s without much success for reasons which I will explain later. Even recently there are some commercial recognizers which, for good reason, use a phonetic recognizer as a front end. Without going into details, those recognizer are mostly used offline for extracting analytics from large amounts of human speech, and not intended for human-machine interaction, and there are some reasons why a phonetic recognizer would be preferable for that. However any serious experimenter would tell you that using a phonetic recognizer as a front end of an interactive system, in other words a system where the ultimate goal is to get the words, or the concepts behind the words with high accuracy, would show degraded performance when compared to a “traditional” modern speech recognizer that goes directly from audio to words without using phonemes as an intermediate step.

The point, which I call “the illusion of phonetic segmentation” in my book, is that in a pattern recognition problem with a highly variable and noisy input (like speech), making decisions  on intermediate elements (like the phonemes) as a step towards higher level targets (like words), introduces errors that will greatly affect the overall accuracy (which is measured not on the phonemes but on the words). And even if we had perfect phonetic recognition (which we don’t…and by the way, a phoneme is an abstract linguistic invention, while words are more concrete phenomena … see my previous post), the “phonetic” variation of word pronunciation (as in poteito vs. potato, or the word “you” in “you all” pronounced as “y’all”) would introduce further errors. So, a winning strategy in patter recognition in general, and in speech recognition in particular, is that of not taking any decision until you are forced to, in other words until you get to your target output (words, concepts, or actions).

A friend of mine used to say “A good general is a lazy general”, meaning that when you have to take an important decision, the more you delay it to gather more data, the better the chance is to take a good decision, eventually. The same concept applies to speech recognition. In fact modern state-of-the art speech recognizer (yet based on ideas developed in the 1970s) do not take any decision until they have gained enough evidence that the decision–typically on which words where spoken–is, so to speak,  the best possible decision given the input data and the knowledge the system has. And this is not done using simple frame-to-phoneme mapping tables, but using sophisticated statistical models (called Hidden Markov Models, or HMMs) of phonetic units that are estimated over thousands of hours of training speech. Yet, one of the open questions today is whether we have reached the end-of-life of these models and whether we need to look for something better, as I discussed in a previous post. Now, can we explain Hidden Markov Models in a simple way to make laypeople understand what they are, and help demystifying speech recognition without describing it in the wrong way? Yes, we can, and I’ll try to do that in one of my future posts. But the point here is that, as I have said before, speech recognition reached a stage where it can be successfully deployed in many applications, also thanks to the work of thousands of people who developed and improved the sophisticated algorithms and math behind it. However, to move to the next stage, we need to continue to work on it. The problem is neither simple, nor solved.

The hard job of getting meanings

•July 29, 2012 • 3 Comments

If I had to chose one of the areas of human-machine natural communication where we haven’t ben able to make any significant stride during the past decades, I would choose “general” language understanding. Don’t get me wrong. Language understanding per se has made huge steps ahead. IBM Watson‘s victory over Jeopardy! human champions is a testimony of that. However Watson required the work of an army of the brightest minds in natural language processing, semantics, and computer science for  several years. How replicable is that for any other domain without the need of hiring 40 world-class scientists is questionable. And that’s the issue I talking about. We can say that speech recognition (that is going from “speech” to “words”) is a relatively simple technology.  If you have the right tools, all you need is a lot of transcribed speech and you get something that allows you to put together an application that works in most situations. Of course, as I said in a previous post, machines are still far from human performance, especially in presence of noise, reverberation, and other amenities where our million-year old carbon-based technology still excels. But, at least in principle, commercial recognizer do not need 40- PhDs to be put into operations. And you do not even need to understand the intricacies of spoken language. The technology evolved towards machines that can learn from data. And that’s what we have today. It is not perfect, and many scientists are working at making it better. But I am sure the next generation of speech recognition systems will also be usable in any domain without having to understand the intricacies of speech production and perception. Unfortunately that is not true for language understanding (that is going from “words” to “meanings”). If you want to put something together, something like Siri, for instance, you have to understand grammars, parsing, semantic attachments, ontologies, part-of-speech tagging, and many other things that I am not going to mention here. Let’s see why.

The ultimate product of speech recognition are the words contained in utterances. You don’t have to have a PhD in linguistics or to be a computer scientist to understand what words are. Anyone with a good mastering of a language–and almost everyone master their own mother tongue–can listen to an utterance and more or less identify the words in it. There is also a general agreement about the notion of words. Words are words are words. And we all learn words since our childhood. So the output of a speech recognizer is quite well defined. As a consequence, it is relatively easy to come up with a lot (today in the hundreds of millions) of examples of utterances and their transcription into words. And if you master the art of machine learning, you can build, once and for all, a machine that learns from all of those transcribed utterances, and the same machine will also be able to guess words it has never seen before, since the machine will learn also the word constituents, the phonemes, and it will be able to put models of phonemes together to guess any possible existing or non existing word. And once that machine is built you have a general speech recognizer.

Instead, the ultimate product of language understanding is meaning. What is meaning? Linguists and semanticists have bee arguing for decades on how to represent meaning, because a representation of meaning is not evident to us, or at least it is not as evident as words are to every speaker of a language (as long as the language “has” words). So, a representation of meaning, one of the many available in literature, has to be defined and imposed, in order to start building a machine that would be able to extract meaning from spoken utterances or text. While we can easily transcribe large numbers of utterances into words and use them to create machine-learned speech recognizers, associating meanings to a large number of utterances, or large amounts of text, it is way more laborious, and way more error prone than transcriptions. Moreover, associating meanings to utterances or text can be done only if you have a rather good understanding of linguistics and the chosen meaning representation. And, besides that, meaning representations depend on domains. Think for instance the meaning of the word “bank” in a financial domain, as opposed to the meaning of the same word in aeronautics or geography.

This is why, in my opinion, we do not have yet general language that can learn from data in any domain and in any situation. The point I want to make is that in artificial intelligence–I am using a general notion of AI here–we have been quite successful at building machines that can learn how to transform input signals into a representation which is somehow naturally evident, like words, but we have never been so successful when the representation is hidden, or artificially imposed by theories, like meanings. Is there a way out of this impasse towards building machines that truly understand language in a general way and are not so domain specific like the ones we have today? I hope so.

Post Scriptum:  Humans express meanings naturally not by using a sophisticated symbolic representation, but by rephrasing the original sentence into another. For instance, if you ask someone what the following sentence means “I am going to the market to buy potatoes”, he or she may say something like “this means that the person who is speaking has the intention to go to a place where they typically sell fresh produce to acquire edible tubers.” This process can go on ad infinitum by substituting every word of the new sentence with its definition, for instance taken from a dictionary. So, if we do that for another round of substitutions, we may get something like “…has the determination to move in a course towards a physical environment where they give up property to others in exchange of value such as unaltered agricultural products to come into possession of fit to be eaten fleshy buds growing underground.”

Apples and Oranges

•July 17, 2012 • 1 Comment

There is a lot of talking about the performance of Apple’s Siri. An article appeared on the New York Times  a few days ago brutally destroying Siri from the point of view of its performance, and others compare it with Google Voice Search. As a professional in the field, having followed Google Voice Search closely, knowing well the people who work on it–many of them former colleagues in previous lives and respected scientists in the community–I know and trust it is definitely a state-of-the-art (and beyond) piece of technology. As far as I know it is based on the latest and greatest mathematical models, it uses a lot (and I mean “a lot”) of data, and I can trust my Google friends are squeezing this data to the last bit of information in order to get the best possible performance.  I have to admit I know very little about Siri. It is general knowledge that it uses Nuance’s speech recognizer, but beyond that we know very little. While the scientists from Google are often present at the most prestigious conferences in the field, and often present results to the community, I haven’t met anyone from Siri yet at one of these conferences, nor have I seen a scientific paper on it. I may have missed that, but I think it is Apple’s policy not to divulge what cooks inside their labs.   So, besides the speech recognizer, I don’t know the details of how Siri works, although I have some ideas about it. But the point I want to make here is that comparing Siri with Google Voice Search is a little bit like comparing Apples and Oranges (the pun is intended). Let me explain why.

When you speak into Google Voice Search, what comes out is text. Nothing more than text that represent a more or less accurate  word-by-word transcription of what you just said. The text is fed to the traditional Google search, which returns a traditional Google list of results, or “links”. Rather Siri tries to do something more. Text is an intermediate product, provided by the speech recognizer as a transcription of what  you just said. But Siri then uses the text to “understand” what you just said, and provide an answer. So, for instance, if you speak into Google and say “set up an appointment with John Doe for next Tuesday at three”, Google Voice Search will show you links to Web pages about setting up appointments, pages about Mr.  John Doe, and pages having to do with Tuesdays and the number three. Siri, instead, if it worked correctly, will pop up your calendar for next Tuesday and eventually will setup the requested appointment. How do we compare them?  We should probably compare them on something which is common to both, something that both do. For instance the word by word transcription of what you said. Speech researchers have been doing this type of evaluation for decades using a measure called word error rate (or WER), which is computed by matching the correct sequence of words for every test utterance, and the corresponding sequence of words returned by the speech recognizer, aligning them, finding out the words in common and whether some other words were spuriously inserted or deleted. Typically the WER is computed on a test set of a few thousands of utterances randomly selected in order not to introduce any bias. And the very same utterances should be fed to both systems, in order for the test to be fair. That is not an easy task to do for a non-insider on the two systems at hand, but with the solution of some technical problems, it could be done more ore less accurately.

Now Google Voice Search stops there. There is nothing beyond the textual transcription for Google (although I notice more and more some direct answers provided by Google to direct questions, a sign they are doing some sort of language understanding). In Siri, as we saw, there is more. But when we get to the realm of meaning–and not just words–things start to become complicated, and there are no easy to implement standard tests for gauging the accuracy of a language understanding system. Scientists tried many times in the past, but it was always a hard task, and worse, it was always depending very much on the application, and not easy scalable and portable to different domains. And the problem is, if we can’t measure things, we cannot improve them easily. Now, I am sure that Apple scientists have their own way to measure the average accuracy of Siri’s answers–I would have that if I were one of them–but I know it’s not an easy task.  In conclusion, we can’t really compare one with the other, since they do different things, but we can can certainly say that one makes its users happier than the other.