Phonetics and Linguistics
University College London
Computers and robots in science fiction books and films seem to engage effortlessly in conversations with their human masters. They understand what is asked of them and their replies make sense. But does science fiction give us a realistic impression of what it would be like to converse with a machine? What aspects of robot language are not only beyond current technology but also beyond common sense? How close is current technology to the linguistic skills of HAL or C3PO? This essay critically analyses examples drawn from science fiction to expose the real problems of building a conversational machine. Is the gap between fictional systems and current technology so large that we'll never be able to program a computer to understand human language?
Winston thought for a moment, then pulled the speak-write towards him and began dictating in Big Brother’s familiar style: a style at once military and pedantic, and, because of a trick of asking questions and then promptly answering them [..], easy to imitate.
George Orwell, Nineteen Eighty-Four, 1949.
Orwell’s “speak-write” is what we now know as a dictation system, a mechanism for converting spoken words to printed text without the need for a keyboard. The dictation systems of today have much the capability expressed in Orwell's dystopian vision, and in fact are based on a famous paper published in 1983! A modern dictation system is not though, as Orwell imagined, a specialised hardware device but packaged software that runs on a general-purpose digital computer. Such a program transcribes what you dictate into a sequence of words entered into a word processor, it allows you to add punctuation and set formatting, and even checks your spelling and grammar. Dictations systems constitute, as we shall see, one of the few success stories of speech and language technology.
A modern dictation system has a very large vocabulary and allows you to dictate in short, continuously-spoken phrases; it also understands editing commands so that it is possible to create, edit, print and save a document purely by voice. Such systems usually require an enrolment session where you must first read out a few hundred sentences so that your pronunciation can be modelled. Systems also benefit from a study of your existing documents to analyse your vocabulary usage and writing style. In operation, dictation systems learn from their mistakes and so require you to correct them as you work. The performance of such a system depends in part on how well you learn to modify your speaking style to the expectations of the system. A dictation system prefers consistent articulation of speech, with only small variations in speaking rate or voice quality. Well articulated speech is not necessarily easier for the system to understand, because it is further from the average speaking style it was trained on in the factory. Also performance is strongly affected by how well the words spoken match the system’s internal statistical model of word occurrence; unusual word sequences are more likely to be mis-recognised. A motivated, co-operative user can obtain less than 5% word error on the first attempt at transcription of a sentence, and even after correction can obtain a word transcription speed greater than most people can type.
What are the weaknesses of this technology? Firstly, these systems are optimised for the specific task of converting short phrases into transcribed words. Their specialisations include the use of a document/editing language model, their requirement for on-line correction, and the way in which the dictation process requires visual feedback. Removing any of these will significantly affect performance. Just removing the facility for users to report recognition errors will prevent the system learning from its mistakes and lead to an inability to track changes over time in either the speaker’s voice or the content of documents. However the most significant problem with this technology is that at no time does the system have any understanding of what the text being transcribed is actually about. This lack of awareness of topic or reference often leads to bizarre interpretations of utterances when recognition fails. You might produce a sentence which seems clearly articulated, comprehensible and contextually relevant, but the system types a word string containing the most odd and irrelevant vocabulary. A human secretary would not do this; it is more likely that he or she would ask for clarification or substitute an alternative sentence that does make sense. Neither of these possibilities is currently available to dictation system programs. The lack of an ability to comprehend the text being dictated causes other problems. Whereas a human secretary can be told to “go back to the part where I was talking about previous projects and insert a standard paragraph about the company history”, a dictation system has no model of the content of the document above the level of the words in the text. It cannot locate a paragraph on the basis of what it is about; it cannot make changes to the text based on its meaning.
[Gay Deceiver is a robot vehicle programmed by voice]
".. Hello Gay."
"Analyze latest program execute-coded Gay Deceiver take us home. Report."
Deety sighed. "Typing a program is easier. New program."
"Execute-code new-permanent program. Gay Deceiver, countermarch! At new execute-code, repeat reversed in real time latest sequence inertials transitions translations rotations before last use of program execute-code Gay Deceiver take us home."
"New permanent program accepted."
Robert Heinlein, The Number of the Beast, Fawcett Columbine, 1980.
It is not surprising that in fiction, computers recognise what people say without error. It would be tremendously dull to read about someone battling to get a computer to recognise his voice. Likewise a computer so stupid that it mis-recognised commands would not make a convincing antagonist in a movie plot. However in reality, computers mis-recognise speech all the time. Despite enormous engineering efforts in signal processing and pattern recognition, current speech recognition systems are often ten-times worse than humans at recognising what somebody is saying. The advantage of dictation systems is that they are tuned to one person’s voice, one style of speaking, one accent, one person’s language use, one quiet environment. But relax any of those assumptions and recognition gets worse: if the system doesn’t know who is speaking, if the speaker changes his/her style of speaking, if the speaker has a different accent to the one the computer was trained on, if the topic of conversation changes, if the speech is produced under varying background noise conditions. A speech recognition algorithm that has perhaps 5% word error on the read speech of a known speaker in a quiet place suddenly generates 50% word error on spontaneous speech of an unknown speaker over a telephone. It is difficult to say what are the exact causes of the weakness in current recognition approaches. It is likely that it is a combination of a number of assumptions and approximations in the current models. For example, current systems make assumptions about the ways in which the pronunciation of a word can vary which are probably not correct; they also use a probabilistic model of the sounds of speech that appears overly simplistic. My own prejudices are that current systems have too much ‘built-in’ knowledge of the problem: constraints and structures imposed by their designers. Systems would be better if they were more free to develop their own internal representations of the problem.
INTERIOR: REBEL BLOCKADE RUNNER -- SUBHALLWAY.
Artoo stops before the small hatch of an emergency lifepod. He snaps the seal on the main latch and a red warning light begins to flash. The stubby astro-robot works his way into the cramped four-man pod.
C3PO: Hey, you're not permitted in there. It's restricted. You'll be deactivated for sure.
Artoo beeps something to him.
C3PO: Don't call me a mindless philosopher, you overweight glob of grease! Now come out before somebody sees you.
Artoo whistles something at his reluctant friend regarding the mission he is about to perform.
C3PO: Secret mission? What plans? What are you talking about? I'm not getting in there!
Artoo isn't happy with Threepio's stubbornness, and he beeps and twangs angrily.
A new explosion, this time very close, sends dust and debris through the narrow subhallway. Flames lick at Threepio and, after a flurry of electronic swearing from Artoo, the lanky robot jumps into the lifepod.
C3PO: I'm going to regret this.
George Lucas, STAR WARS Episode IV A NEW HOPE, 1976
In the film Star Wars, two of the main characters are robots. R2D2 is a technical maintenance robot that is shaped like a squat cylinder, while C3PO is a ‘protocol droid’ with a humanoid shape. R2D2’s forte is dealing with electronic and mechanical machinery, and although it can understand human speech it can only speak in beeps and whistles. C3PO is designed for working with humans and can both perceive and produce human speech as well as “over six million forms of communication”.
C3PO was voiced by the actor Anthony Daniels with no noticeable electronic manipulation to make his speech more mechanical. Daniel’s C3PO has a style of speaking that makes him seem rather scatty and nervous, but it is fundamentally a human voice. C3PO’s use of spoken human language is, of course, essential to the plot. If C3PO communicated to R2D2 in beeps and whistles, we wouldn’t know about the “secret mission” in the scene above. But to be intelligible to us, the movie audience, C3PO has to communicate with a series of sounds that simulate the sounds humans make with their vocal tracts. And in particular the sound coding system belonging to English. The more you think about this, the more ridiculous it appears. Here are two robots communicating in a mixture of two languages, one of which is copied from the language used by a species of humans thousands of years distant in another galaxy! Neither robot has a vocal tract nor the limitations of human hearing; why use the complexity and ambiguity of human speech when the information they need to convey to each other would be better encoded, encrypted and sent by wireless network?
My computer rarely communicates to me in speech; in fact it rarely communicates with me in human language at all. The English I see in programs and messages are just stored character strings that are neither generated nor interpreted by the computer. The nearest my computer gets to communicating with me in English is when it corrects my spelling and suggests changes to my use of English grammar. At least here its output is prompted by a linguistic level of analysis of my interactions. Most computers communicate visually through graphics (windows, cursors, buttons, icons) or through lists and tables (folders, menus). It would not necessarily be more convenient for it to use speech, unless I was visually impaired. The utility of speech is in fact limited to certain types of communication situation, and really only when one of the participants is human.
However, if a computer system chooses to use the speech modality to communicate, then it must use that modality in an explicit emulation of how humans use that modality. If it were to use a different sound code or language, then the human would first have to learn the language of the machine; and if the system used microwaves or ultrasound, the human couldn’t perceive it at all. A robot doesn’t necessarily have a vocal tract, but it must make sounds as if they had been generated by a vocal tract because human speech perception is tuned to perceive sounds generated that way. The particular sounds we use to encode the identity of a word in speech are not arbitrary but a collection of sounds that we can make with our own vocal apparatus. Most languages of the world have sounds like /b/ and /n/ and /i/ because they are easy to produce and relatively distinct. The precise form of the sounds we use in speech is strongly influenced by our specific anatomy and neurophysiology. Our nearest animal cousins, chimpanzees, have a similar vocal tract to us but are unable to control it in a way that allows them to make speech-like vocalisations, even as mimicry. But not only is the acoustic form of each individual sound affected by the form of the human vocal tract, in connected utterances the way in which one sound merges into the next depends upon the dynamics of the articulators in the vocal tract: the rate at which the tongue, jaw, lips and soft palate can move, and their tendency to undershoot their targets. This kind of variation in the acoustic form of speech sounds is not a problem for our perceptual system because we expect that variation to occur. That is we expect that speech will be affected by limitations in the speed and precision of articulations. But a robot without a vocal tract is not subject to these limitations – it has much more freedom in the range and dynamics of sounds. The problem is that this freedom makes the speech sound "not human" and actually harder to process. To talk like a human, then, means that our robot will need to have the limitations of a human vocal tract.
You might think of three ways in which a robot could obtain the use of a human vocal tract: it might be provided with a hardware mechanism which copied the shape and operation of a human vocal tract; it might be provided with a software simulation of a human vocal tract; or it might use a model of the acoustics of a human vocal tract. This last way allows the computer to make speech-like sounds but not in a human-like way. All of these have been tried, but none produce connected speech that sounds as if it could have been produced by a human being. The problem is that there are inadequacies in the knowledge we have of vocal tract use and its acoustic properties, as well as computational problems in simulating its aerodynamics. Fortunately there is a fourth approach, which has only recently come to prominence, and that is for the robot to use a real human vocal tract! Surprisingly, the most successful modern systems just use recordings of actual human voices. These recordings are indexed, cut-up and glued together to generate new utterances. The use of real recordings ensures that the voice has fairly natural characteristics even though the assembly process adds a somewhat unnatural prosody. This is currently the best way for a computer to obtain use of a human vocal tract to speak. It reminds us that in fact C3PO only speaks through Anthony Daniel’s vocal tract.
The first generation of computers had received their inputs through glorified typewriter keyboards, and had replied through high-speed printers and visual displays. HAL could do this when necessary, but most of his communications with his shipmates was by means of the spoken word. Poole and Bowman could talk to HAL as if he were a human being, and he would reply in the perfect, idiomatic English he had learned during the fleeting weeks of his electronic childhood.
Arthur C Clarke, 2001: A Space Odyssey, 1968
But should a computer sound exactly like a human being anyway? While it may be true that natural human speech is optimal in terms of the ease with which humans can process it, that is not to say that approximations to human speech cannot be made quite intelligible. In particular, models of the acoustics of vocal tracts (so called formant synthesizers) can produce highly intelligible speech although they often have a machine like character.
There is a real problem with giving a robot a human-like voice, which is that we then make all kinds of assumptions about its mental capabilities. A machine that sounds like a human leads us to believe that we are listening to a conscious entity that has its own thoughts, desires, intentions, emotions and so on. With HAL this is exactly what we are meant to infer. HAL’s rather flat, cold creepy voice is part of the structure of the film: it leads us into believing that a computer can become mentally unstable and ultimately a killer. This inference about the capabilities of the machine is dangerous because machines will not have the same outlook on the world as humans. They won’t have the same drives, the same understanding. They may not have emotions. It seems pretty unlikely that they will be conscious of their behaviour in exactly the same way that humans seem to be. Anyway, if they were conscious why would they want to speak with the voice of a different species beyond a minimal requirement of intelligibility?
Another peculiar aspect of robot speech is that the changes to the voice and style of speaking used by a machine to convey emotion have also been copied from humans. But the changes in human voices that reflect boredom, fear, tiredness or elation are surely caused by the cognitive and physiological state of the person. Even a machine with emotions (whatever that means) won't have human-like physiological reaction to those emotions, and thus has no reason to simulate the effect of human physiology on its speech unless it is trying to manipulate the emotions of human listeners. Similarly, the use of a harsh and monotonous robot voice quality in SF films is for our benefit, not the robot's. An emotionless robot needn't have a flat pitch or an over-regular rhythm; speech from a software synthesizer needn't sound as if it came through a pipe or suffered amplitude distortion. In an SF film, the use of human emotional cues by machines is just a simple way in which the mental state of the machine can be communicated to the audience. But if the machine actually was unstable, or had evil intentions, there is no reason why this should affect its speech in such a way. The use of human voice attributes is actually more confusing – it leads us to incorrect inferences about the mental state of the machine. Of course we don’t yet know what an authentic voice of a machine would be like. Until then we will continue to apply the same judgements to a machine voice as we do to human voices; and we will continue to make invalid inferences about its intelligence, understanding, motivation and emotional state.
When C3PO or HAL talk, they do so with perfectly judged expressive speech, with fluent articulation and natural prosody. They are not “reading a script” like an automated station announcement system. They appear to understand what they say. Their communication has specific goals. They appear to want those goals to be achieved. However it is only recent research that has studied how underlying concepts can be mapped onto expressive speech. Most contemporary speech output systems simply convert text to speech.
Modern text-to-speech systems are able to produce highly-intelligible speech automatically from input text. A typical system will process input text through a number of stages: (i) pre-processing to convert abbreviations, acronyms and numbers into words, (ii) phrasing to split the text up into prosodic phrases, (iii) intonation marking to identify accented words, (iv) pronunciation of words, (v) timing of phonetic elements, and (vi) signal generation. Depending on the signal generation method, the output speech can have a robotic or a human voice quality. However, in either case, the output speech is still recognisably that of a machine since it often has an unnatural rhythm and intonation. Such systems also tend to operate poorly on highly structured material such as tables, lists or formulae. The speech output of text to speech systems sounds like the read speech of someone that doesn’t understand the material.
The difficulty of assigning a suitable rhythm and intonation in text to speech conversation is mainly due to the fact that such assignment relies on more knowledge about the text than is present in the text itself. The prosodic properties of a sentence relate to the overall meaning and function of the sentence, and this can’t be deduced by processing each word in the sentence in isolation. The word ‘toe’ in the phrase “the toe nails on the other hand” must carry the main accent otherwise the phrase doesn’t make sense. A system cannot get the prosody of such a sentence correct without understanding its meaning. Notice that the meaning and function of the sentence is not available to the TTS system because the sentence was not generated by the system itself in order to communicate. As we shall discuss below, the extraction of meaning from text is far from understood. The very information the system needs to generate adequate prosody is denied it when it reads from text.
Both C3PO and HAL have the advantage that when they speak they have ‘made up’ the sentences themselves. They have access to the reasons why each sentence needs to be generated and what it means. They know for example which information is old, which is new, which is important, which contradicts previous knowledge. This allows them to signal through prosody the information structure of the sentence, to put focus on the most important elements. Once again, however, the rules they use about how to signal this structure can only be copied from the rules used by humans to do the same task, otherwise the human listeners would not understand the result.
HAL: I’m sorry Frank. I think you missed it. Queen to bishop 3, bishop takes queen, knight takes bishop, mate.
HAL: We can certainly afford to be out of communication for the short time it will take to replace it.
HAL: I honestly think you ought to sit down calmly, take a stress pill, and think things over.
Arthur C Clarke and Stanley Kubrik, 2001: A Space Odyssey (Screenplay), 1968.
It is easy to forget that computers and robots in SF films are actually conversing in what is, to them, an alien language. Not only can they recognise what people say and can generate speech that we can recognise, they can analyse the meaning of sentences, and can generate sentences that convey specific meanings as well as humans. But how the meaning of utterances relates to the actual sequence of words used is far from simple. The syntax and semantics of natural languages has been studied for decades without producing a definitive set of computational rules that map words to meanings or vice versa. Consider the three sentences spoken by HAL, above. Each of these sentences uses the verb take and in each of them take has a different meaning. In the first sentence “knight takes bishop”, take means capture or get possession of. In the second sentence “time it will take”, take refers to a defined period of time over which something is performed. In the third sentence “take a stress pill”, take refers to ingestion. The word take has a large number of other possible meanings. My Webster’s dictionary lists over 50 related senses. Thus knowing that a sentence contains take is not enough; a machine also needs to be able to determine which meaning of take is being used. Word ambiguity is a significant problem in finding the meaning of a sentence. Each ambiguous word in an utterance leads to a factorial increase in the number of possible meanings of the utterance as a whole. And furthermore, it is not the case that the machine can choose between them on the basis of sense or likelihood alone. The resolution of ambiguity actually requires knowledge about the world – knowledge of what the words in the utterance relate to.
A commonly quoted example of these two issues is shown by the sentence “time flies like an arrow”, which has been shown to have a hundred or more possible meanings! The difficulties arise because of the many possible individual meanings of the words. The word ‘time’ can be a noun (the thing time), an adjective (a type of fly) or a verb (a command to measure duration). The word ‘flies’ can be a noun (some insects or parts of men’s trousers) or a verb (movement through the air). The word ‘like’ can mean ‘similar-to’ or it can mean ‘fondness for’. So how are we able to decide that the reading of this utterance should be that “the nature of time is that it travels rapidly in a straight line” rather than “there are certain kinds of flying insects that demonstrate a fondness for arrows”, or any one of the many other possible meanings? The knowledge needed to pick the most favoured interpretation is not to be found in the utterance itself. The ability to choose between interpretations comes from an understanding we have about objects in the world; our understanding of time, flying and arrows. But this understanding has to be deeper than a logical statement about the semantics of individual words. Consider the two sentences “I saw the man with the telescope” and “I saw the man with the apple”. Even ignoring the multiple meanings of “saw”, both sentences are ambiguous with respect to whether phrase "with X" refers to the man or the act of seeing. Was the man carrying a telescope/apple or did I see the man using a telescope/apple? To choose the right interpretation in each case requires more knowledge than what a telescope is or what an apple is, it requires knowledge of what telescopes can be used for and what apples can be used for. The sentences themselves do not help us; we need to know facts about the world that these sentences describe.
Giving a computer or robot knowledge about the meaning of words so that they can analyse the meaning of sentences is itself fraught with difficulty. I'll describe two particular classes of problem: grounding and human perspective.
A dictionary defines words in terms of their relations to other words. Thus from a dictionary it is possible to create a kind of web of relationships, which for example relate seeing to eyes to perception to knowledge; or which relate apples to vegetables to food. Similarly a corpus of language describes likely combinations of words in sentences, and the collocations of ideas expressed in those sentences might tell you more about the ways in which the words are used. A corpus might tell you that seeing and telescopes collocate more frequently than seeing and apples. Thus there seems some hope for linguistic analysis to lead to a resolution of the ambiguity in "I saw the man with the apple". But in general, such types of linguistic analysis do not add to the knowledge of the system about the world. They simply add to knowledge about the statistics of words. The words themselves remain abstract symbols that do not relate to real objects or events in the real world. This is known as the "symbol-grounding problem", and it comes about because we are trying to learn about language from observation rather than by taking part in linguistic exchange. The only way in which a robot can know what "apple" refers to is if that robot has also developed a concept of an apple and made the link between the spoken name and the concept. Thus it appears that the only way for a machine to use language as humans do would be for the machine to learn language itself, not be given large quantities of digested text.
But even if a robot could learn language, this is not to say that its interpretation of language would match that of a human being. The problem is that so much human language is built from a human perspective of the world. When we use "I see" to mean "I know" we are just reflecting how much of our understanding of the world comes through our visual sense. The words we use for colours, sounds, tastes, smells and textures are affected by our peculiar limitations in seeing, hearing, tasting, smelling and touching. Perhaps even the way in which we catalogue and categorise our experiences is affected by our human limitations of perception and memory. The complexity of the ideas we can comprehend, or the complexity of the sentences we can understand may be limited by the size of working memory. Understanding language may require us to understand the motivations and desires of other human beings. All these things suggest that to really understand human language the way that humans do requires a machine to have a knowledge of the human perspective.
We are a very long way from programming systems of such complexity. However progress has been made on understanding human language within limited domains. For example, there have been numerous attempts at building natural language interfaces to commercial computer database products. Modern databases store information in cross-linked tables and use a query language called SQL to let users to locate information (rows, columns, cells, summaries) according to supplied criteria. Since SQL has an artificial syntax, many users find it difficult to learn. A tool that could convert everyday questions into SQL could thus make the information in the database available to untrained users. Thus a user may ask: “What percentage of American customers bought an electrical product?” The problem in making such a translator is that it must relate the meanings of words in the question to the information present in the database. Not only must the translator realise that an American customer means “customer-country=USA”, and that calculating a percentage involves finding both a count and a total, the translator must also know that customers are the kind of object that buy products. On the whole these systems are a collection of special tricks for dealing with particular phrases (“What percentage …”) with techniques for parsing noun phrases using information drawn from the database definition. Users find such systems very valuable, but tellingly adjust the way in which they construct natural language queries to the capabilities of the system. The users rapidly learn to avoid constructions the system is unable to process correctly. Systems can also feed back to the user information about the translation into SQL, either as SQL directly or as a natural language paraphrase. This helps the user learn what range of questions he is able to ask.
The weakness of such technology for building a conversational machine is the fact that the system only attempts to understand questions about a given database. The system searches for an interpretation of the question under the assumption that it is about the information in a database. Although it is able to present its translation as a formal expression in SQL, it does not of course have an understanding of any of the symbols it uses. It may know that a particular column expresses the colour of a product, but it has no knowledge of what a colour is. This means it is unable to reflect on the meaning of any of the attributes in the database: “which product versions have a reddish colour?” Another problem is that such a system does not scale to larger applications. As the number of tables and field names and types increase then the potential for ambiguity also increases.
Baley had sat down during the course of his last speech and now he tried to rise again, but a combination of weariness and the depth of the chair defeated him. He held out his hand petulantly. ‘Give me a hand, will you, Daneel?’
Daneel stared at his own hand. ‘I beg your pardon, Partner Elijah?’
Baley silently swore at the other’s literal mind and said, ‘Help me out of the chair.’
Isaac Asimov, The Naked Sun, 1957
Conversational machines in SF are surprisingly good at understanding speech spoken by human beings. By this I don’t just mean good at recognising the words that were spoken, nor good at working out the meaning of utterances, though both are true, but also good at determining what the speaker wants from the communication.
However, in this example Robot Daneel Olivaw, a humanoid robot indistinguishable in appearance to a human being, seems to be having a problem understanding what Elijah Baley wants. The problem the robot has is not recognising the words, nor in calculating the meaning of the words, but in working out why Baley would have said those words. It is just a characteristic of human communication that we don't always encode information or requests for information in the simplest way. Baley could have said "Robot, help me stand up", which would have been clearly a request for help directed at the robot. Such direct requests between humans are considered impolite, so someone might say instead "Could you help me stand up?" which is not a request for help at all but a request about the availability of help. The phrase "Give me your hand, will you?" is one more stage removed, where even the point of the request is missing.
This additional processing of utterances beyond their literal meaning is called pragmatics. Listeners ask themselves not just what is the meaning of an utterance, but why would the speaker have spoken such an utterance? The listener needs to consider both its meaning and its implications. The problem however, is how can a robot, with very different concerns about the world to a human being, appreciate the circumstances under which a human might have chosen to speak a particular utterance. In general a single utterance can have many implicatures – potential implications, and choosing between them is not easy. A man saying “this shirt is frayed” may be wanting someone to bring him a new shirt, or suggesting that part of the household budget should be directed to the purchase of new shirts, or indicating that he feels that he is not being cared for. What is probably not being communicated is the simple fact that his shirt is frayed! A robot servant that replied "OK, fact stored" would not be doing his job. But to really understand the implications of the statement requires that the robot knows about things like: frayed shirts are unsightly, frayed shirts give an impression of poverty, frayed shirts may indicate that the wearer is a bachelor, this man is concerned that not enough money is being spent on his clothes, this man is unhappy that no-one cares about him enough to be concerned for his appearance. And these implicatures themselves have other implications: why is he worried about his appearance? Why hasn't he done anything about it himself? Our robot seems to need to know a lot about the world and human motivations and desires just to understand a single simple utterance.
While significant progress has been made in speech recognition (witness the availability of dictation systems) and some progress has been made in the linguistic analysis of sentences (used in natural-language interfaces to computer systems), much less progress has been made in a computational theory of pragmatics. It is easy to see why. The working out of the implications of an utterance needs knowledge about the world beyond that represented in the sentence. This is not knowledge that is available by an algorithm applied to a series of sentences, but it is knowledge that must either be learned or programmed separately. Our robot needs to know enough about the context, about the real facts of the situation within which the utterance was made. It is an issue about world knowledge, not an issue about language. Such knowledge just doesn't exist in a form that can be programmed into a computer, nor do our computers yet have the ability to learn such facts from their own experience.
Attempts have been made to build systems for interacting with people in a conversational manner. Chatterbots are computer programs that engage in simulated conversation with a user through typed interactions. The aim of a chatterbot is to fool users into thinking that they are interacting with another human being, following the idea of Turing’s “imitation game”. The development of chatterbots (or chatbots) comes from the growth in Internet chat rooms, where users can type messages to each other, often complete strangers. The idea came that this was an opportunity to build programs that could participate in chat rooms. Each year there is a competition to choose the best chatterbot, called the Loebner prize. For the prize, human judges attempt conversations with the programs, and rate them on a scale of human-ness.
The general structure of chatterbots seems very simple, and on the whole has not changed since the earliest conversational programs like Eliza. Conversations are treated as stimulus-response pairs, where a remark by the human triggers a potentially relevant output from the computer. Many tricks are used to fool users into thinking that the program understands what they are saying, for example if the user asks ‘Who is the president of Brazil?’ the system may reply, “I don’t know, do you?” By trying out the program with many users, the system designers can see what are common forms of interaction and build in pattern-matching machinery to convert a question into an answer. The largest chatterbots have tens of thousands of pattern match rules. More intelligence can be simulated by letting the chatterbot learn from the interactions, and to let users refer back to previous turns. These functions require a modest amount of linguistic processing to extract and store information and to know how to deal with anaphora (using a word like ‘it’ to refer back to something discussed earlier).
Surprisingly, it seems that many users are prepared to ‘suspend their disbelief’ and treat these programs as actual conversational partners, even confiding personal information to the programs. However any extended conversation on a topic rapidly shows that the program has no real knowledge of any subject and no understanding of what a conversation is about. Conversations rapidly deteriorate into trivial, contentless exchanges of fixed phrases.
Chatterbots are designed to hold conversations but in fact only simulate the use of words in conversations. There is no exchange of information, no co-operative problem solving, persuasion, argument, agreement, consolation, mutual regard, or any other non-trivial aspect of human-human conversation.
Dr. Calvin said softly, "How are you, Brain?"
The Brain's voice was high-pitched and enthusiastic, "Swell, Miss Susan. You're going to ask me something I can tell. You always have a book in your hand when you're going to ask me something."
Dr. Calvin smiled mildly, "Well, you're right, but not just yet. This is going to be a question. It will be so complicated we're going to give it to you in writing. But not just yet. I think I'll talk to you first."
"All right. I don't mind talking."
Isaac Asimov, Escape!, Astounding Science Fiction, 1945.
So far we have considered four aspects of language technology: speech input, speech output, linguistic analysis and pragmatic analysis. But how do these fit together to create a conversational machine? A fifth aspect must be how the system demonstrates its understanding through behaviours triggered by the act of communication. A machine that sat and hummed while you spoke to it would not seem to be understanding. Of course one way in which understanding can be shown is for the machine to take the appropriate action: to obey a command to take some physical action, to speak a message that is a coherent response to a request for information, to explicitly acknowledge that a request is understood. But there are other properties of human-human dialogue that a machine might do well to imitate.
Firstly, human participants in a dialogue use various mechanisms to demonstrate, while the other person is speaking, that they are following the argument, that they are interested, that they want to hear more. This kind of feedback can involve things like eye contact, head movement, body posture, or vocalisations like "Mm", "I see", "Uh-huh". These activities help control the communication process; they are a way of communicating meta-level information about the success of the dialogue.
A defining characteristic of dialogue is turn taking; but even here it is not always obvious when one person should stop speaking, or when the other should start. Normal dialogue is not like half-duplex radio communication where the participants say "Over" when they want to surrender control to the other party. Surprisingly it is not even necessary for one person to pause before the other person speaks – dialogue often has many instances of simultaneous overlapping speech. The cues that a speaker gives to say they are willing to give the other person a chance to speak are not just when an utterance reaches its logical conclusion. Sometimes turns can be suggested by phonetic changes in pitch or speaking rate.
Part of the problem of analysing turn taking is that it is built on a relatively ill-researched area of co-operative problem-solving. A dialogue is an interactive language game that has its own conventions and rules. The rules aid the participants in making sense of the utterances, just as knowledge of the world helps a listener understand the implications of a sentence. There have been suggestions for conventions like: quality, quantity, and relevance. The listener should assume that speakers produce utterances based on what they believe to be true, that they are providing only the significant information, that what they say is relevant to the argument. Such rules can be broken to achieve certain effects, for example in “damning with faint praise” a true but understated fact has wider implications, as in “That student was always very punctual” – implying that he had few other worthy attributes.
There has been little progress on the computational modelling of these subtle aspects of dialogue. Most existing man-machine dialogue systems rarely rise above a question-answer format. Take speech enquiry systems as an example. Speech enquiry systems comprise a speech recognition system to convert speech signals into word strings, a speech synthesis system to convert text to speech, a database system to hold information required by the user, and a component called a ‘dialogue manager’ that mediates between the components to create a system that can answer queries posed by voice. Typical applications would be for flight enquiries, account enquiries or cinema ticket booking. Most of these systems have been designed to answer queries posed by the general public over the telephone and this makes severe demands on the technology. For example the speech recognition component must deal with a significantly poorer quality input signal from previously unknown speakers. There are not the capabilities to enrol the speaker nor learn from mis-recognition. One way to compensate is to restrict the system to a very narrow domain of discourse. Thus any given enquiry system can only deal with spoken language about a particular task. These restrictions have effect throughout the system: firstly the recognition system can be limited to certain words and a simple grammar; secondly the dialogue manager may have a fixed sequence of interactions; thirdly the synthesis system may only need a finite vocabulary of phrases. While users stay within the restrictions and expectations of the system, performance is adequate for simple tasks like flight times or ticket reservation. The systems are however rather fragile, and when users stray outside the capabilities of the system (by using words outside the vocabulary, by being disfluent or unclear, by speaking over background noise, by speaking over the system’s prompts, by failing to answer questions with just the information expected, etc) communication can rapidly break down. Such systems work best when the computer takes charge of the dialogue, asks simple questions and gets direct answers. This requires that users themselves have to be aware of the limitations of the system.
Speech enquiry systems are perhaps the closest complete system to a conversational machine, but the goal-directed nature of the dialogue, an inability to obey conventions in turn taking, frequent mis-recognitions requiring multiple conformations – all lead to an impoverished experience. Manufacturers usually allow users to operate such systems with tones from the phone keypad instead of using speech if they wish.
[Robot AI Sven wants a better battery …]
"Sounds good to me. We'll get one."
"I have already ordered it in your name and it was delivered this morning."
that a little high-handed?"
"Dictionary definition of high-handed, an adjective meaning overbearing or arbitrary. That is not an arbitrary decision but a logical one that you have agreed with. Overbearing is defined as a domineering action or behaviour. I did not attempt to dominate, therefore do not understand the application of this word. Could you explain …"
"No! I take it back. A mistake, right? We need the battery, I would have ordered it in any case, you merely helped me out. Thanks a lot."
Brian regretted the last, but hoped that Sven's phonetic discriminatory abilities weren't that finely tuned yet to enable it to determine the presence of sarcasm by the inflection of words. But he was sure learning fast.
Harry Harrison & Marvin Minsky, The Turing Option, Warner Books, 1992
In SF it is rare to see a child robot learning to become an adult robot. Since robots are manufactured rather than grown, they are born with the physical abilities of an adult, and authors seem to infer that this also means they are born with the mental abilities of an adult. Most SF computers and robots that use language also seem to have their language skills 'built-in', they do not seem to need to learn how to use language the way humans do.
There are perhaps three ways a robot could get a language faculty: (i) it could learn language itself, individually, through experience; (ii) it could be given a copy of the artificial brain of a different robot that had already learned language; or (iii) it could be programmed to use language by its human designer. None of these seem ideal. The first alternative means that the robot manufacturer would have to set up robot schools to teach them basic skills; the second means that all robots would have the same language performance (and memories); the third means that we will have to understand language much better than we do now. If we have to build a watertight computational model of language before we build a robot, then we may be waiting a long time.
Actually the trend in language technology has been away from explicit programming of language skills towards machine learning algorithms that look for and exploit regularities in language that have been uncovered by statistical analysis of a large corpus of language examples. There is already a shift, then, towards providing our computer systems with the ability to learn how to use language, rather than with the facility directly. However progress here is limited by the type of language data available. These tend to be large collections of text collected from books and newspapers. They are not collections of linguistic interactions in which the meaning and implicatures of each utterance have been exhaustively established in some machine-compatible form. Thus what these systems learn is about the surface form of language, they do not learn anything about what it refers to.
You get a sense of what is lacking in such superficial analysis of language when you use an information retrieval system to find documents based on an enquiry. Using the Google search engine to find pages on the worldwide web for example. The documents that are calculated to "match" the phrase you type are only related in the sense that they contain the words you used. This literal minded search leads to a number of weaknesses: (i) words have multiple senses, so that some documents containing the words are about something completely different, for example a search for "speech" finds instances of politicians' speeches; (ii) documents are treated as bags of words, so that there is no necessary match of meaning, for example a search for "flights to Barcelona" will find documents about flights from Barcelona; and (iii) the same sense can be described by multiple words, so that some documents that are extremely relevant are not returned because they don't happen to use the words you used, for example a search for "oil-paintings" may not match documents describing "old-masters".
Of course a superficial statistical analysis of language has advantages: it is very fast, it has a low computational overhead. And importantly, since there is not yet available a deeper analysis of good quality, it is the best method we currently have for some tasks.
Scene 83: PLANET SURFACE
Jason looks up to the rock ledge. The unseen DEMONS continue to
chant "GORIGNAK... GORIGNAK..."
JASON: Wait, the pig lizard is gone. Why are they still chanting for the
GWEN: Turn on the translation circuit.
Teb flicks a switch and we hear the Demons in English.
DEMONS: ROCK... ROCK... ROCK...
David Howard, Galaxy Quest, 1999
Computer systems in SF are particularly useful for translating alien languages. How they do this in the absence of a corpus of analysed and labelled samples is never made clear. Not only does an alien have unknown communication hardware, unknown phonetic coding, unknown lexicon and unknown syntax, it also has an unknown culture within which the elements of language are grounded. A few minutes listening to some sounds that they make will not get you very far in terms of understanding.
But even in translation between human languages, where there are many resources such as bilingual dictionaries, grammar studies and even parallel corpora, automatic translation systems still leave much to be desired. Some problems arise from the ambiguity of language: where one utterance can have many literal meanings. But other problems are related to the fact that a good translation is not just a representation in another language of the literal meaning of a sentence, but a representation that carries the same implications. While it may be the case that some literal translations have the same implications, there are many instances where a literal translation leads to startling results. This is a rich vein of humour, such as the translations into English of notices in foreign hotels: such as a Finnish hotel's instructions in case of fire: "If you are unable to leave your room, expose yourself in the window".
This survey has shown that there remains an enormous gap between the capabilities of science fact and the conversational machines of science fiction. In Speech Recognition there are still problems in recognising speech in everyday situations, where an unknown speaker is speaking spontaneously and in a noisy environment. In Speech Synthesis there are problems in generating a prosody that matches the information structure and communication goals for an utterance. There is also the problem of how to give a machine a vocal tract of its own, rather than to re-use recorded speech. In Grammar and Meaning there are problems in dealing with the ambiguity of language and the difficulty of finding the most appropriate meaning for an utterance without knowledge of the context within which it was uttered. In Pragmatics there are problems in determining the implications of utterances beyond their literal meaning, involving the machine being able to understand why someone would have said such a thing. In Conversation there are problems in following human conventions in the structure and execution of dialogue. In Language Acquisition there are problems in grounding language in real experience. And lastly, in Translation there is the problem that to translate is to know all of these things in two language cultures.
In the absence of actual conversational machines, science fiction writers have modelled the linguistic abilities of their creations on the linguistic abilities of humans. However this generates unrealistic expectations about the future of language technology. Since machines will not be human, and since human language is so fundamental to what it means to be human, it is unreasonable to expect a machine’s use of human language to be human-like. As I have indicated, some aspects of a machine's use of human language in SF are just silly. There is no necessity for a machine to have a human vocal tract or human emotions. But the gap between imagination and reality does expose the significant problem in getting artificial systems to understand human language in anything like the way that humans do. If we were to sum-up the weakness of current technology in one phrase, then it would be that our systems process only the surface form of language without understanding what it means.
What would constitute understanding by a machine? In this essay I have implicitly assumed that understanding can be broken down into four stages: a recognition stage where sounds are decoded into word strings, a syntactic-semantic stage in which a number of possible literal meanings are established, a pragmatic stage in which the intended meaning and its implications are resolved, and finally a communication stage which demonstrates that the utterance has been processed and understood. This is the conventional way of thinking about language processing, and it is predicated on the assumption that understanding is fundamentally a linguistic problem. On this basis, a conversational machine would emerge once we had programmed a computer to achieve human performance on all of these stages. If a computer were able to process an utterance through these stages as a human would, we would admit that the computer had understood the utterance.
I think the actual challenge we have is different. In some senses our machines can already process utterances through these four stages of understanding. It is just that they are very limited on the type of utterance they can process, and their overall accuracy is rather poor. Our current systems seem to make elementary mistakes of word recognition, meaning and inference. They have human-like adult voices but sound as if they have no understanding of what they are saying. The problem I think is that we expect machines to have a human level of performance. We program into them the complexity we recognise in adult human language without concern for how this prevents understanding being achieved. Thus we ask our dictation systems to recognise text they couldn’t possibly understand, or our text-to-speech systems to read out text that a human has written. The idea, of course, is to create a language technology that is useful to us. But in requiring adult human competence we hit up against the significant problem that human language is alien to a machine.
In this essay I have tried to show that human language is perhaps more peculiarly human than most technologists and SF writers have accepted. The quality and structure of speech, the encoding of meaning, the decoding of implicatures are essentially founded in the physical and mental limitations of humans and their specific concerns on this planet. Machine intelligences, when they arise, will not have such concerns, and their view of the world will not be the same as ours. As a consequence they way in which they use language will be alien to us as our use of language will be alien to them. The best we can hope for is that we can learn to adapt to their way of thinking and vice versa. This will not allow us to communicate in our natural language, but perhaps communicate enough to get some useful jobs done.
Current technological development to program machines with human language, then, will simply lead to superficial approximations and simulations of language. We will be trying to build an alien-human translation system when the two cultures do not share the same world model. This is not to say that translation of sufficient quality could be achieved to be useful for some tasks, perhaps in limited domains. But to create a machine with understanding of both breadth and depth requires that the machines develop their own translation systems. Through experience with the world and interactions with humans they would develop a mapping that was consistent with their own experience, even if that was alien to our experience.
We need to let computers learn themselves how best to use human language to achieve their objectives.
If we give them a need to communicate, a desire to communicate, then this will force them to develop a version of human language of their own, a voice of their own. Of course it will be a modified form of human language, but with its own peculiarities of lexicon, syntax, semantics and pragmatics. When we hear it we’ll recognise that we are talking to a machine. When the computer speaks it will be in utterances that it really understands, because they will be utterances it will have made up itself to meet its own demands for communication. When the computer listens, it will only accept utterances from us that it understands; utterances that meet its preconceptions about the topic of conversation. When we use human language that it can't understand we won’t consider that to be a weakness of the computer – it will be a normal error of translation between cultures.
How can we arrange for a computer to learn a version of human language that meets its own communication requirements? Only by letting the computer learn language the same way that human children learn language, through experience. We should be programming computers to learn how to communicate, not attempting to build into them adult human language.
Science fiction writers got it wrong when they assumed that future computers and robots will use human language the way that humans do. But they got it right when they also gave those computers and robots the ability and desire to communicate. The “personality” of these machines expressed in their use of language will not come from limitations of how their language sub-system was programmed, but from how they have learned to use language in pursuit of their own goals.
Dictation programs are available from a number of vendors. The program ViaVoice by IBM is described at: http://www.software.ibm.com/speech/. The program NaturallySpeaking by Dragon Systems is described at: http://www.scansoft.com/naturallyspeaking/.
The earliest chatterbot program was called Eliza, who took the role of a psychotherapist, see: J.Weizenbaum, Eliza – a computer program for the study of natural language communication between man and machine, Communication of the ACM, 9 (1996), 36-45. The Loebner prize competition for the best chatterbot each year is described at the web site: http://www.loebner.net/Prizef/loebner-prize.html. The winner in 2000 and 2001 was a chatterbot called Alice, which is described at http://www.alicebot.org/.
Paul Grice’s conversational maxims are described in: H.P.Grice, Logic and conversation, in Speech Acts ed P.Cole and J.Morgan, Academic Press 1975.
Microsoft’s database management software SQL Server 2000 has the facility for posing database queries in English. This is described at: http://www.microsoft.com/sql/evaluation/features/english.asp.
A number of text-to-speech systems can be tried out over the web. Some can be found though the meta-site www.speechandhearing.net.
Isaac Asimov's collection of robot short stories is published as "The Complete Robot", Voyager Books, 1995. Isaac Asimov died in 1992.