The hard job of getting meanings

July 29, 2012

The hard job of getting meanings

If I had to chose one of the areas of human-machine natural communication where we haven’t ben able to make any significant stride during the past decades, I would choose “general” language understanding. Don’t get me wrong. Language understanding per se has made huge steps ahead. IBM Watson‘s victory over Jeopardy! human champions is a testimony of that. However Watson required the work of an army of the brightest minds in natural language processing, semantics, and computer science for several years. How replicable is that for any other domain without the need of hiring 40 world-class scientists is questionable. And that’s the issue I talking about. We can say that speech recognition (that is going from “speech” to “words”) is a relatively simple technology. If you have the right tools, all you need is a lot of transcribed speech and you get something that allows you to put together an application that works in most situations. Of course, as I said in a previous post, machines are still far from human performance, especially in presence of noise, reverberation, and other amenities where our million-year old carbon-based technology still excels. But, at least in principle, commercial recognizer do not need 40- PhDs to be put into operations. And you do not even need to understand the intricacies of spoken language. The technology evolved towards machines that can learn from data. And that’s what we have today. It is not perfect, and many scientists are working at making it better. But I am sure the next generation of speech recognition systems will also be usable in any domain without having to understand the intricacies of speech production and perception. Unfortunately that is not true for language understanding (that is going from “words” to “meanings”). If you want to put something together, something like Siri, for instance, you have to understand grammars, parsing, semantic attachments, ontologies, part-of-speech tagging, and many other things that I am not going to mention here. Let’s see why.

The ultimate product of speech recognition are the words contained in utterances. You don’t have to have a PhD in linguistics or to be a computer scientist to understand what words are. Anyone with a good mastering of a language–and almost everyone master their own mother tongue–can listen to an utterance and more or less identify the words in it. There is also a general agreement about the notion of words. Words are words are words. And we all learn words since our childhood. So the output of a speech recognizer is quite well defined. As a consequence, it is relatively easy to come up with a lot (today in the hundreds of millions) of examples of utterances and their transcription into words. And if you master the art of machine learning, you can build, once and for all, a machine that learns from all of those transcribed utterances, and the same machine will also be able to guess words it has never seen before, since the machine will learn also the word constituents, the phonemes, and it will be able to put models of phonemes together to guess any possible existing or non existing word. And once that machine is built you have a general speech recognizer.

Instead, the ultimate product of language understanding is meaning. What is meaning? Linguists and semanticists have bee arguing for decades on how to represent meaning, because a representation of meaning is not evident to us, or at least it is not as evident as words are to every speaker of a language (as long as the language “has” words). So, a representation of meaning, one of the many available in literature, has to be defined and imposed, in order to start building a machine that would be able to extract meaning from spoken utterances or text. While we can easily transcribe large numbers of utterances into words and use them to create machine-learned speech recognizers, associating meanings to a large number of utterances, or large amounts of text, it is way more laborious, and way more error prone than transcriptions. Moreover, associating meanings to utterances or text can be done only if you have a rather good understanding of linguistics and the chosen meaning representation. And, besides that, meaning representations depend on domains. Think for instance the meaning of the word “bank” in a financial domain, as opposed to the meaning of the same word in aeronautics or geography.

This is why, in my opinion, we do not have yet general language that can learn from data in any domain and in any situation. The point I want to make is that in artificial intelligence–I am using a general notion of AI here–we have been quite successful at building machines that can learn how to transform input signals into a representation which is somehow naturally evident, like words, but we have never been so successful when the representation is hidden, or artificially imposed by theories, like meanings. Is there a way out of this impasse towards building machines that truly understand language in a general way and are not so domain specific like the ones we have today? I hope so.

Post Scriptum: Humans express meanings naturally not by using a sophisticated symbolic representation, but by rephrasing the original sentence into another. For instance, if you ask someone what the following sentence means “I am going to the market to buy potatoes”, he or she may say something like “this means that the person who is speaking has the intention to go to a place where they typically sell fresh produce to acquire edible tubers.” This process can go on ad infinitum by substituting every word of the new sentence with its definition, for instance taken from a dictionary. So, if we do that for another round of substitutions, we may get something like “…has the determination to move in a course towards a physical environment where they give up property to others in exchange of value such as unaltered agricultural products to come into possession of fit to be eaten fleshy buds growing underground.”

Roberto Pieraccini

Language Understanding, Speech Rcognition

language understanding, meaning, semantics, speech recognition

3 responses to “The hard job of getting meanings”

Roger Moore

July 30, 2012

A nice analysis of the problem Roberto, but where to look for the solution? I believe that there are two key areas that might be profitable: ‘enactivism’ and ‘metaphor’. Enactivism is the idea (kicked off by Maturana and Varela in their seminal book ‘The Tree of Knowledge’) that meaning is “embodied”, i.e. an organism ‘understands’ its environment by virtue of having a physical embodiment that is able to act within that environment. Likewise, an organsim understands the ‘intentions’ of another organism by virtue of being able to enact thos intentions itself. Metaphor (as proposed by Feldman in his book ‘From Molecules to Metaphor’) can be seen as a mechanism for interpolating from ‘grounded’ (i.e. physically-based) meaning to more abstract representations. I’ve written/spoken about most of these ideas in various recent papers.

Reply
Roberto Pieraccini

July 30, 2012

Thanks Roger. Very interesting answer and very interesting potential solutions. I know Jerry Feldman, he works with me here at ICSI, and he is involved in a quite ambitious project, called Metanet, where they are studying metaphors in different languages and from different perspectives (even the one related to neurosciences).

http://www.icsi.berkeley.edu/icsi/projects/ai/metanet

Reply
Demystifying Speech Recognition … but not too much « THE VOICE IN THE MACHINE

August 8, 2012

[…] phoneme is an abstract linguistic invention, while words are more concrete phenomena … see my previous post), the “phonetic” variation of word pronunciation (as in poteito vs. potato, or the […]

Reply

The hard job of getting meanings

Share this:

3 responses to “The hard job of getting meanings”

Leave a comment Cancel reply