Programming Your Own Jibo

•August 6, 2014 • Leave a Comment

Jibo Blog

This is Andy Atkins, VP of Engineering here at Jibo.

First off, I want to send out a big “Thank You” to all of you for the overwhelming show of support you have given Jibo since we’ve launched our crowdfunding campaign. Your contributions, questions, and the “buzz” you’ve helped create, confirms that we’re on to something here, and we want to be as transparent as we can about what we’re doing and where we are.

As I’ve been wading through the email that has come in since we’ve launched the campaign, it is clear that many of you are dying to learn more about Jibo, its applications and capabilities, privacy and security, as well as how one might develop additional “skills” for Jibo. In response, we’ll continue to update our FAQs to address as many of your shared questions as we can.

Over the coming weeks and months, I’ll also…

View original post 1,187 more words

High Tech AND High Touch

•July 15, 2014 • Leave a Comment

Aging and technology. From Jibo’s Blog at

Jibo Blog

Two of the most undeniable trends in our world today are the unprecedented aging demographics and the ever-increasing pace of technology innovation.

In the past 100 years, we have added 30 years to average life expectancy and the 85+ age group is the fastest growing segment of the population. By 2050, the population of centenarians is projected to reach nearly 6 million. It’s a whole new – and increasingly grey – world. At the same time, we are seeing shortages across the board in aging-related care professions including geriatricians, certified nurse assistants and home care aides.

In today’s connected world, technology is streamlining businesses processes, enabling new business models and changing the way people interact with each other. We have seen technology transform industries and the demographic imperative ahead of us is forcing individuals, families, communities and countries to think in new ways about new models and opportunities at the…

View original post 644 more words

We Are Robot

•July 4, 2014 • Leave a Comment

I am Robot! Patrick Hanlon speculates on the future of social robots.

Jibo Blog

This blog also appeared on Forbes website.

We are entering a new era of technological connectivity. We already have ‘smart products’ and ‘wearable devices’ and ‘the Internet of Things’.

Now there are robots, too.

Actually, this is not new either. Robots have been utilized in manufacturing for the last two decades, lumbering back and forth between assembly points, dropping off raw materials or delivering assembly parts and final products.

The difference now is that these new robots do not lumber. They skitter. They wink at you. They are deliberately designed, much like C3PO, to mimic our actions and register an emotional context. Whereas robots of the past worked for us, the latest versions (social robots) want to work with us. More significantly, they want us to befriend them.

This is a dramatic shift in thinking that may take some getting used to. Or not. In fact, the device that…

View original post 1,209 more words

The Next Era of Conversational Technology

•July 2, 2014 • Leave a Comment

This blog post appeared originally on Jibo Blog

In 1922, the Elmwood Button Co. commercialized a toy that could respond to voice commands. The toy, called Radio Rex, was a dog made of celluloid that would leap out of its house when you called its name: “Rex!” That was two decades before the first digital computer was even conceived, and at least three generations before the Internet would jet our lives from the real world into the virtual realms of e-commerce, social networks, and cloud computing. Radio Rex’s voice recognition technology, so to speak, was quite crude. Practically speaking, the vibration produced by the “r” in its name, when spoken loudly, would cause a thin metal blade to open a circuit that powered the electromagnet that held Rex in its house, causing Rex to jump out. A few years ago, I watched in awe one of the few remaining original Radio Rex toys in action. It was still working.

roberto-featured 2

Some people recognize Radio Rex as the first speech recognition technology applied to a commercial product. Whether that could be called speech recognition or not, we have to agree that the little celluloid dog was probably one of the first examples of a voice controlled consumer robot. While rudimentary, I can imagine the delight and wonder that people felt when Rex responded to their own voice. Magic!

Indeed, speech recognition has something magical. That’s why I have spent more than 30 years of my life building machines that understand speech, and many of my colleagues have done the same. The machines we have built “understand” speech so well that there is a billion dollar industry based on this specific technology. Millions of people talk to machines every day, whether it is to get customer support, find a neighborhood restaurant, make an appointment in their calendar, or search the Web.

But these speech-understanding machines do not have what Radio Rex had, even in its own primitive simplicity: an animated body that would respond to voice. Today, social robots do. And even more: they can talk back to you, see you, recognize who you are and where you are, respond to your touch, display text or graphics, and express emotions that can resonate with your own.

When we consider voice interaction for a social robot, what type of speech and language capabilities it should have? What are the challenges? Yes, of course Siri and Google Now have paved the way to speech-based personal assistants, but despite their perceived popularity, is speech recognition technology ready for a consumer social robot? The answer is yes.

First, let me explain the difference among the speech recognition technologies used for telephone applications (a-la “Please say: account, payments, or technical support”), for personal assistants (like Siri, Google Now, and the most recently Cortana) and for a social robot.

In telephone-based applications, the speech recognition protocol is synchronous and typically initiated by the machine. The user is expected to speak only during well-determined intervals of time in response to a system prompt (“Please say …”). If the user does not speak during that interval, or if the user says something that is not encoded in the system, the speech recognition machine times out, and the system speaks a new prompt. This continues until something relevant is said and recognized, or either the machine or user lose their patience and bail out. We have all experienced this frustration at some point.

In telephone-based systems, any type of information has to be conveyed through the audio channel, including instructions for users on what to say, the information requested, and the possible clarification of any misunderstanding between the human and the machine (including the infamous “I think you said …”). Anything spoken has to be limited in time, in a rigid, turn-taking fashion between the system and the user. That is why telephone-based systems are so “clunky”, to the point where more often than not, they are quite annoyingly mechanical and repetitive. Although I have built some of the most sophisticated and well-crafted telephone-based speech recognition systems for over a decade — and these systems are used by millions of people every week, — I never heard anyone say: “I love my provider’s automated customer care service. It is so much fun to use it that I call it every day!”

Then came the smartphone. The fact that a smartphone includes a telephone is marginal. A smartphone is a computer, a camera, a media appliance, an Internet device, a computer game box, a chatterbox, a way to escape from boring meetings, and yes, also a telephone, for those who insist on making calls. Although it hasn’t happened yet, smartphones have the potential to make clunky, telephone-based speech understating systems obsolete. Why? Because with a smartphone, you are not forced to communicate only through the narrow audio channel. In addition to speech, you can enter text and use touch. A smartphone can talk back to you, and display text, images, and videos. The smartphone speech recognition revolution started with Siri and Google voice search. Today, more and more smartphone services are adopting speech as an alternative interface blended with text and touch modalities.

The distinction between telephone-based, audio-only services versus smartphone speech understanding is not just in the multi-modality of the communication protocol. The advent of fast broadband Internet has freed us from the necessity of having all the software on one device. Speech recognition can run in the cloud with the almost unlimited power of server farms and can truly recognize pretty much everything that you say.

Wait a minute. Just recognize? How about understanding the meaning of what you say?

There is a distinction between recognition, which is the mere transcription of the spoken words, and understanding, which is the extraction and representation of the meaning. The latter is the domain of natural language processing. While progress in speech recognition has been substantial and persistent over the past decades, natural language has moved forward more slowly. Siri and its cloud-based colleagues are capable of understanding more and more, and even some of the nuances of our languages. The time for speech recognition and natural language understanding is finally ripe for natural machine and human communication.

But…is that all? Is talking to our phones and tablets or open rooms — as in the upcoming home automation applications — the ultimate goal of spoken communication with machines?

No. Social robots are the next revolution.

It has been predicted by the visionary work of pioneers like Cynthia Breazeal (read Cynthia’s blog here), corroborated and validated by the recent announcements of social robot products coming to market. This is made possible by the confluence of the necessary technologies that are available today.

Social robots have everything that personal assistants have—speech, display, touch—but also a body that can move, a vision system that can recognize local environments, and microphones that can locate and focus on where sounds and speech are coming from. How will a social robot interact with speech among all the other modalities?

We went from prompted speech, like in the telephone systems, to personal assistants that can come to life on your smartphone. For a social robot, there should neither be a “speak now” timeout window, nor the antiquated “speak after the beep” protocol, nor a button to push to start speaking. Instead, a social robot should be there for you all the time and respond when you talk to it. You should be able to naturally cue your robot that you are addressing it by speaking its name (like Star Trek’s “Hello Computer!”) or simply by directing your gaze toward it while you talk to it. Fortunately, the technology is ripe for this capability, too. Highly accurate hot-word speech recognition—like in Google Now—and phrase spotting (a technology that allows the system to ignore everything that is spoken except for defined key-phrases) are available today.

While speech recognition accuracy is always important, we should always consider that social robots can have multiple communication channels: speech, touch, vision, and possibly others. In many cases, the different channels can reinforce each other, either simultaneously or sequentially, in case one of them fails. Think about speech in a noisy situation. When speech is the only alternative, the only recourse is to repeat (at nauseam) the endless “I did not understand what you said, please say that again?” For social robots, however, touch and gesture are powerful alternatives to get your intent across when placed in situations where speech is difficult to process.

In certain situations, two alternative modalities can also complement each other. Consider the problem of identifying a user. This can be done either via voice with a technology, called speaker ID, or through machine vision using face recognition. These two technologies can be active at the same time and cooperate, or used independently — speaker ID when the light is too low, or face ID when the noise is too high.

Looking ahead, speech, touch, and image processing can also cooperate to recognize a person’s emotion, locate the speaker in the room, understand the speaker’s gestures, and more. Imagine the possibilities when social robots can understand natural human behavior, rather than the stilted conventions we need to follow today for machines to understand what we mean.

I have spent my career working at the forefront of human-machine communication to make it ever more natural, robust, textured – and, yes, magical. Now, we are at the dawn of the era of social robots. I am thrilled to help bring this new experience to the world. Just wait and see.

•March 14, 2013 • Leave a Comment


Google (GOOG) has acquired a startup from the University of Toronto’s computer science department. The “ground-breaking” startup called DNNresearch Inc was founded by University professor Geoffrey Hinton and two of his graduate students in 2012. Google was interested in the company’s research on deep neural networks, which will assist the company in improving its speech and image recognition software. Professor Hinton will now split his time between his work at the University and continuing his research with Google. Financial terms of the deal were not disclosed. The University of Toronto’s press release follows below.

View original post 356 more words

Musings on Semantics

•October 9, 2012 • 1 Comment

This is a re-posting of my editorial on the latest ICSI’s Newsletter.

W. Brian Arthur, in his book The Nature of Technology: What It Is and How It Evolves, describes the evolution of technology as a combinatorial process. Each new technology consists in a combination of existing technologies that “beget further technologies.” Moreover, each technology springs from the harnessing of one or more physical, behavioral, mathematical, or logical principles that constitute its foundation. Innovation proceeds either through the establishment of new principles—which is typically, but not only, the domain of science—or through new combinations of existing technologies.

Not all technologies are alike, however; sometimes a single new technology, an enabler, generates a myriad of new possibilities that lead to the creation of new industries and new economies, and in some very rare occasions contribute to the definition of a new era of our civilization. Such were the steam engine, the digital computer, and the Internet.

We may find ourselves wondering what the next enabler will be. Of course no one knows for sure, and any attempt to make a prediction will most likely be wrong. But researchers have a special role in our technological future. They cannot predict what the future will be but, to paraphrase Alan Kay’s famous quote, they can attempt at creating it. Well, looking at the trends of current research in information technology, we can definitely see that the attempt to create a new future based on automatically deriving higher levels of data understanding is today one of the most challenging endeavors researchers are embarked on.  Let me be more specific.

Our era is characterized by an unprecedented amount of information. It is no surprise that a significant amount of technological research today is devoted to the creation, management, and understanding, by computers, of the wealth of data – text, images, and sounds around us. But it is the understanding, which is the most challenging among those problems and the farthest from a satisfactory solution. A large portion of the research community aims to devise algorithms and systems to automatically extract the meaning, the semantics, from raw data and signals. In fact, a lot of the research carried out at ICSI, as at many other research centers, can be described as “looking for meaning in raw data.”  Research on natural language and visual and multimedia signals is diving into the problem of deeper understanding. Beyond the mere (so to speak) recognition of words and entities in language and objects in images, we are now trying to get to the deeper content, such as metaphors and visual concepts. But it’s not just that. The research in network security, for instance, is trying to assign meanings to the patterns found in streams of communication among computers in order to detect possible intrusions and attacks, while theoretical computer scientists are trying to find meaning in DNA strings and in networks of brain activity.

However, we are not quite there yet. Think, for instance, about the promises made by the vision of the semantification of the whole Web. The vast majority of the Web comprises unstructured raw data: text, images, audio and videos. Tim Berners Lee was the first to envision a semantic web, and many have been working toward that dream, with limited degrees of success. Even though many agree on ways to encode a semantic web, and Google’s knowledge graph is an example of one of the most advanced large-scale attempts to structure factual knowledge and make it readily available to everyone, a full semantic Web is not there yet. The knowledge graph starts from existing structured knowledge, for instance the facts about Albert Einstein life, and connects that structured knowledge to Web searches. The knowledge graph includes millions of entries, which is an infinitesimally small number compared to the vast universe of the Web. Are you and your papers on the knowledge graph? Are all recent world facts, blog entries, and opinions on the European financial crisis in the world graph? Maybe they will be, but the question of coverage and that of keeping the information fresh and updated is yet to be solved. And an even more serious issue: the Web is not just text. For instance, the amount of video on the Web is growing at a mindboggling rate. Some of the recent published statistics about YouTube estimate that 72 hours of video are uploaded every minute, over 4 billions hours are watched every month, and in 2011 alone, YouTube had more than 1 trillion views, the equivalent of over 140 views for every person on Earth.

It is true that we have ways to encode semantics into Web pages, as seen in the work of W3C; semantic representation languages, like OWL are widely used today.   But with the Web growing at dizzying speed any attempt to manually annotate every unstructured element—text or video—with its meaning or something closely related to it, is bound to fail. If we want to fulfill the dream of a fully semantic Web, we need methods for automatically understanding text, images, and videos, including speech, music, and general audio.

The enabling potential of a fully semantic Web is huge. A fully semantic Web will change the concept of searching the Web into that of asking questions of it and getting answers, not just, as is sometimes possible today, in restricted domains, but everywhere, about any topic, no matter the size, or popularity, or language. It will help transform mere facts into structured data and actionable information, not just about Albert Einstein and other famous people, but also about you, the grocery store at the corner, and what your friends write on their blogs. It will complete the process of moving from raw data—the gazillions of pages of online text and video—toward higher abstractions of knowledge, understanding, and wisdom. And that’s not all. Semantification of the whole Web will enable the flourishing of new areas of the information industry that are not possible today, or that are possible with a lot of handcrafting and ad-hoc solutions in limited domains and hardly scalable to other domains, languages, and media. It will allow us to interact with the Web as we do with humans, asking it questions in the same way we ask human experts questions. It will allow us to automatically compare sources of information for accuracy and truthfulness, even if they are in different languages.

However, the true full semantification of the Web is not just an ambitious dream, but a necessity. We may reach a point, in the not too distant future, when the Web will be so large that current search methods will no longer be the most effective way to find the information we need in the format we need it. How do we sort through an increasingly vast number of documents, and not just text, on a particular topic, with many independent and conflicting opinions, comprising true and false statements and different points of view? We already have to use specialized search engines to find the cheapest flight among independent travel sites and aggregators and to find the most economical reseller of a particular item. That’s possible today  because of the underlying structured form of the commercial sites. In a way, they are already semantified. But think, for a minute, of doing the same thing with unstructured information: raw text, audio, images, and video. Think about searching for documents that “have the same meaning” of another document, regardless of their language, form, and wording.

I don’t know if we will see a full semantification of the Web in our lifetimes. I don’t know if that’s even possible, or whether it is that great enabler we dream of. But one thing is certain: research is clearly moving toward a deeper understanding of the information around us, and if it is successful, we will be able to experience a higher level of social, economic and political influence of the technologies in our lives.  The full semantification of the Web, whenever it will happen, will be a game changer of enormous proportions, an enabler of industries and services which will impact all aspects of our lives. We are working on it.

Open the pod bay doors … HAL

•August 10, 2012 • Leave a Comment

When astronaut Dave Bowman tries to go back to the mothership, HAL refuses to open the pod bay door with the calm and categorical “I am sorry Dave, I am afraid I can’t do that” famous line in one of the most dramatic scenes of the movie 2001 a Space Odyssey.

Now, imagine HAL was actually built with the available speech recognition technology of 2001. This is, more or less, how the conversation would have been instead, according to my friend VUI designer and sage Jonathan Bloom:

HAL: Welcome to HAL. Please listen carefully, as our options have changed. You can say, “Open the pod bay doors,” “Play chess” . . .
Dave: Open the pod bay doors, HAL.
HAL: Sorry, I didn’t understand. You can say, “Open the pod bay doors” or press 1, “Play chess” or press 2, “Learn the alphorn” or press 3 . . .
Dave: <presses 1>
HAL: Around what time would you like the pod bay doors to open? You can say “Today,” “Tomorrow,” . . .
Dave: Now! Damn you!
HAL: Main menu. Please listen carefully, as our options have changed. You can say, “Open the pod bay doors,” or press 1, “Play chess” or press 2 . . .
Dave: <emphatically presses 1 multiple times>
HAL: I’m really, really sorry. Even though your call is really, really important to us, this mission is even more important, and I can’t allow you to jeopardize it.
Dave: I don’t know what you’re talking about, HAL.
HAL: I know you and Frank were planning to press 0 for an agent, and I’m afraid that’s something I cannot allow to happen.