Internet links:
¤UCL Phonetics
& Linguistics

¤Mark Huckvale's
Home Page

Hierarchical Prediction

Mark Huckvale - University College London

This article summarises a seminar given on 27th November 2002 at University College London called "Hierarchical Prediction - bottom-up analysis of perceptual data".


This talk will present a contemporary version of a hypothesis first put forward by Craik in 1943: that the task of our perceptual system is to facilitate prediction of future events. I will try to show that this view of perception fits well with what we already know about how the perceptual system operates, and is also compatible with current engineering models. Significantly I believe that it provides a solution to the "problem of the robot brain": how any newborn organism can survive with no inbuilt knowledge of the world outside its body. I will try to predict where the idea will take us with regard to artificial perceptual systems and the processing of spoken and written language.


"Over the past couple of years I have been thinking about how we might apply the lessons we have learned in building automatic speech recognition systems to the functioning of the human perceptual system more widely. In other words to think more seriously about the engineering problems the perceptual system has to solve. In this talk I would like to introduce a new version of an old idea which seems to address some of the weaknesses of the current model. My motivation is to get general feedback from you to see whether these ideas have promise and to get some specific suggestions for where the approach is lacking in force or practicality."

A thought experiment

"I'm going to start with a thought experiment that I hope will seem relevant by the end of the talk! In this thought experiment, you awake after a night of "uneasy dreams" (as Kafka wrote) to find yourself locked inside a room that also seems to be the control centre of a fiendish machine. Everywhere you look are lights and buttons. It is somehow made clear to you that the only way you will survive here is to operate the machine. You have to observe the lights and press the buttons and only if you do that the right way will you get air and food and warmth. If you don't operate the machine correctly you will freeze, starve or suffocate. There are a few obstacles too, firstly there is no instruction manual; secondly there seem to be around a million lights and over 100 thousand buttons; thirdly, all the lights and buttons are unlabelled! So far, as I'm sure you'll agree, your cause is hopeless. There are simply too many possible actions and combinations of observations and actions. There is however, one additional piece of information you need to know, and surprisingly this will transform your probability of survival. In fact you are locked inside the brain of a robot. The lights are connected to the sensors of the robot, and so reflect what the robot sees, hears, touches, tastes and smells. The buttons are connected to the robots motors and solenoids, and so pressing them causes the robot to move around in its environment, to pick up things, to make noises with its vocal apparatus and so on. Now why does this new information change the situation? Well simply because this new problem must be soluble, because it is just the problem that any newborn infant faces. The new born mind comes with no instruction manual on how to operate its body, has to work a fiendishly complex machine and has no prior knowledge of what all the afferent and efferent nerves actually do. (I may be overstating the case slightly, see the FAQ). I see this thought experiment as telling us that an infant needs a powerful learning system which can sort out the complexities of its existence automatically, from sensory data alone. I will return to this thought experiment at the end of the talk."

Objects in vision

"What is the perceptual system for? In the next few slides I will try and show you one of the main mysteries of perception. What you see here are not chocolate fudge brownies and vanilla ice cream witch chocolate sauce (I'm sorry to whet your appetite at this time of day!). No, what you see is just a pattern of light and dark on the projector screen; the chocolate fudge brownies are in your mind. One of the amazing things about our visual system is that it takes in patterns of light and dark and delivers mental representations of the real three dimensional objects that gave rise to that patterning. Our retina collects a two-dimensional pattern of light, but our mind sees real individual objects. I'll just call these "mental" objects. Most of the time the mental objects are pretty good models of the real objects, though of course sometimes we are fooled."

Objects in sound

"It is not just the visual system that delivers mental objects from sense data. Listen to this piece of music [plays "In the mood" by Glenn Miller and his orchestra]. In these pressure changes in the air you can hear musical instruments: probably trumpets, bass, drums. But look at the waveform of the pressure changes - the very movements of your ear drums - here you will not find any kind of instrument, just a one dimensional signal that is constantly changing with time. The instruments, the notes they play and the tune itself are only in your mind."

Visual grouping

"Of course what I have said so far has been known for centuries. The challenge we have in cognitive science is to provide an information processing account of how sense data is converted to these mental objects. This problem is much less well understood. To see some of the complexities, look at this pattern of dots. It is difficult to perceive any mental object that might have given rise to these sense data. But look now [the image is actually the first frame in a video, when the video plays you see that the dots are attached to the head, body, arms and legs of a person walking]. When the dots move you see that the paths they take are co-ordinated and fit the mental image of a person walking. How does the perceptual system do this? The standard account is called "visual grouping". It is supposed that the perceptual system groups together parts of the sense data which arise collectively from the same real-world object, so that they can be recognised and represented as a mental version of that object. The challenge for information processing accounts of visual grouping is finding how such complex mental objects arise from such little, ambiguous data."

Audio grouping

"Perceptual grouping operates in the other senses too. This is an example of auditory grouping. You will be played sounds from two sound sources, at times you can hear two independent sounds (i.e. two mental objects) and other times you hear only one. [Plays auditory streaming example from IPO Auditory Demonstrations CD]. You should hear that when the tones are close in frequency, our perceptual system groups them into a single sound source, whereas when they are well-spaced in frequency our perceptual system simply presents them as two sound sources. Although the phenomenon of auditory grouping has been well studied, I believe we do not yet have a coherent information processing account of how it works with real complex stimuli."

"Mental" objects

"At this point I do not want to appear as saying anything new or radical. In vision, one famous name is David Marr, who certainly understood this problem very well. Marr said: 'The true heart of visual perception is the inference from the structure of an image about the structure of the real world outside'. And similar ideas have been presented about audition by Albert Bregman: 'The act of hearing may be likened to the work of a cartographer constantly drafting maps of the auditory scene.' And both of these pay credit to the work of the Gestalt psychologists of the 1920's, who have been quoted as saying: 'Characteristics of stimuli cause us to structure or interpret a visual field or problem in a certain way.' The Gestalt approach to perception is that the perceptual system follows rules which allow it to perform grouping. These rules reflect certain patterning in the sense data arising from real world objects; they go by the names of proximity, similarity, continuation, simplicity, symmetry and closure. I want to argue next that these rules are just observed regularities of our perception, not the means by which grouping is performed. In particular I believe that it would be impossible to engineer an artificial perceptual system that worked as well as our perceptual system simply by implementing the gestalt rules for grouping."

Standard model of perception

"This diagram is meant to represent in boxes and arrows form the 'standard' model of perception according to current accounts (in for example the computational auditory scene analysis literature). At the top left you'll see represented in a box everything about the mind that is not the perceptual system! Since this is not the focus of my talk I won't say much about the conceptual system of the mind or how it recognises and manipulates mental objects; I am just concerned with how mental objects themselves may be derived from sense data. In this diagram the little blocks are the mental objects derived by rules from sense data at the bottom of screen. Notice the process seems to be purely bottom-up and automatic - there is little opportunity for the conceptual system to influence how it works. The flow of information downwards seems to be rather constrained in this model, limited to providing certain likelihoods and preconceptions about the world. I think this model, as I have presented it, has some severe problems. Firstly the rules seem to operate by magic: they take in sense data and deliver objects automatically and infallibly. In this account there are no processes of adaptation, goal-directed search or optimisation that would be needed in real life to account for variable, incomplete and noisy data. Of course these processes may be inside the box labelled rules, but if so how does that box know what the best interpretation of the sense data is without knowing the consequences of its choice? Secondly how are these grouping rules acquired? The only mechanism in this model is the feedback from higher levels, which could reward or punish the perceptual system if the mental representation it delivered turned out to be inadequate. This is called 'reinforcement learning', if the organism makes some particularly good or particular bad response to some perceptual interpretation, then the perceptual system itself can be informed. The question is whether such reinforcement learning, which only delivers 'good' or 'bad', is adequate for learning how to do something as complex as grouping. To answer that 'the rules are innate' seems also insufficient, since although knowledge of one's own body may be innate, knowledge of the structure of the world is not. Any system that relied on evolution to understand real world objects would surely be unable to cope with change."

Standard model of perception

"To try and make my criticisms of the standard model more clear, I suggest that this account has not learned anything from the advances that have been made in recent years in using pattern recognition systems for processing real world data. For example, in building speech recognition systems, engineers have found that success relies on not making early decisions about the nature of the data. It is much better to keep one's options open, maintaining a range of possible hypotheses and only make decisions once all factors have been considered. For example it is bad practice to label a speech sound as /p/ independent of what else is happening in the sentence. It may be that the rest of the word is "anana"! Part of this process of delaying decisions is to give time for higher level context to play its part. Early incorrect decisions are extremely hard to undo. How can one weigh up two interpretations if the lowest level of information has been lost? Speech recognition systems exploit a lot of statistical information about the relative likelihood of words, word pronunciations and word sequences; these are required so that the system can find the best interpretation of the signal, which in turn leads to the greatest probability of finding the correct answer. Another lesson from pattern recognition is that reinforcement learning is very slow. The problem is that the feedback about the quality of the processing is limited to 1 bit of information delivered some time after a large chunk of processing has been performed. Such feedback does not say where the processing was in error, and the learning system has to allocate blame itself to its individual processing elements. In fact this weakness of reinforcement learning is one of the arguments that has been used to support the idea that language processing is innate in humans. The argument goes that if children only get feedback like 'good' and 'bad' after they have taken part in a whole complex linguistic exchange with an adult, they will never get enough feedback for them to learn the structure of language itself within their lifetime. Finally, our engineering experience suggests that systems simply based on rules devised by 'experts' are fragile and often fail when exposed to novel, variable or errorful data. The most robust systems are those that can adapt automatically to real world data directly."

Craik, 1943

"So where do we go from here? I have criticised the standard model of perception, suggesting that as proposed it is poorly engineered. What we need is a new idea, so instead I am going to present an old one. Kenneth Craik was a philosopher who worked at Cambridge, England during the second world war, and who died tragically in 1945. His one major published work is the book "On the nature of explanation" which was published in 1943. In this book Craik addresses the issue of what it means to "have an explanation" of something. He presents a scientific solution to this problem by suggesting that explanations come from confirmations of the results of predictions from theory. This in turn leads to his idea that we carry around in our heads a mental model of the real world on which to perform experiments. He says: 'If the organism carries a 'small-scale model' of external reality and of its own possible actions within its head, it is able to try out various alternatives, conclude which is the best of them, react to future situations before they arise, [and] utilise the knowledge of past events in dealing with the present and future '. Thus as we walk around the world we continuously make predictions (using our internal model) of what we expect we will see/hear/feel in a few moments. And our model is continually refined by the feedback between prediction and reality."

What is perception for?

We are now at the heart of this talk. What if we put together our observations about the function of the perceptual system with Craik's idea of mental models? I think we get a new insight into how the perceptual system might process information. The argument goes like this:

  1. The perceptual system delivers mental representations of real-world objects. The mental objects delivered by the perceptual system are mental representations of real objects in the real world.
  2. Mental objects help us predict future sensation. Craik gives us the idea that the utility of mental objects is that they allow us to predict future events - expectations of how sense data will change given that such objects exist in reality.
  3. The perceptual system delivers just those mental objects which are useful to predict future sensation. Assuming that the perceptual system is well engineered and does the best job it can for the organism (not unreasonable given evolution), then the best kind of mental objects it can deliver are precisely those that will benefit the future prediction of sense data.
The combination of these two ideas leads to a radical conclusion: that the perceptual system has a real job to do, one where we can even state criteria for success. The perceptual system does its job properly if it maximises the ability of the organism to predict the future."

Example: visual perception

"To try and make this idea more concrete, consider this object. [rotating image of a computer-generated three-dimensional shamrock]. You've probably never seen such a thing before, but when you look at it you get some sense of its three dimensional shape. That is your perceptual system has delivered a mental object with 3D characteristics. The perception as prediction hypothesis suggests that this particular mental object is nothing more than the best way of explaining how the image in front of you is changing with time. In actual fact, this particular object does not exist outside the computer, and you've only seen a number of 2D images of it. Your 3D mental representation is just an efficient way to encode the data and how it changes with time."

Perception as prediction

"The blocks and arrows model of perception as prediction looks significantly different to the standard model. Similarities are that we still have sense data at the bottom, mental objects associated with the outcome of perceptual processing and a world model that performs reasoning on those objects. However instead of a rule-based system trained by reinforcement learning, we have a predictor which takes past sense data and predicts current sense data - and this can be trained by feedback of the error of prediction. There are two important ideas embedded in this picture: firstly that the perceptual system learns how to operate by continual feedback from real data through supervised training; secondly that the mental objects are themselves nothing more than the hypotheses found by the predictor to be most useful in predicting future sense data. I have drawn the mental objects as being part of the predictor rather than just the output of the predictor because the activity of these objects is also the current state of the predictor - they provide the context within which the next time frame of data is interpreted."

Perception as prediction

"I'll now try and give some practical advantages for this view of how the perceptual system operates. The first is that for the first time we have a clear idea of the job of the perceptual system. Its job is to make a prediction about what is going to happen next. The significance of such a straightforward task is that we can apply engineering principles to optimise its operation: we can define an objective function for success and apply the mathematics of function optimisation. Next we see that the mental objects need not be inherent in the system - so the problem of implementing grouping rules is finessed (the objects emerge from data, we don't need rules to find them). The objects at the output of perception are just the hypotheses found most useful in predicting the future. If they have real-world relevance it is because sense data is itself moulded by such objects in reality. Next we see that recognition of objects is the process of the conceptual system giving names to the mental objects discovered by the perceptual system. Finally we see that the perceptual system adapts to real data through a powerful learning paradigm - supervised training, and not through reinforcement learning. Supervised training provides rapid, diagnostic feedback moment by moment, so that perception can rapidly accommodate new experiences."

Prediction in practice

"Prediction is not a new idea of course, and it has a long history in scientific thought. Some of the earliest applications of prediction were for the prediction of tides (Lord Kelvin's tide predictor of 1876 can be seen in the Science Museum). More recent applications of the idea are seen in weather prediction or in the forecasting of stock prices. If you have a digital mobile phone, then the speech coding system (called GSM) performs data rate reduction by predicting the waveform shape from one moment to the next and so needs only to communicate the prediction error. The underlying algorithm is called LPC: Linear Predictive Coding. If you have a DVD player, then the video coding standard, called MPEG2, uses prediction to reduce the amount of information needed to store a video frame - in effect it uses one frame to predict the essential properties of the next. In all of these applications of prediction what you will see is that the predictor has an implicit model of the phenomena it is predicting: the tides predictor has something within in that 'stands for' the moon; the weather predictor has something that stands for heat energy from the sun; the mobile phone has a model that stands for a speech production system. Inherent in the idea of using prediction successfully is that the predictor makes assumptions about the problem through the use of internal states to represent external objects."

Real world prediction

"Well that has been quite hard going, so for some light relief let's look at a video clip! While you are looking at this, consider how the predictions made by your perceptual system help you interpret what you see and hear. [plays clip from the 'Importance of being Ernest']."

Perception operates at a number of levels

"Visually we saw all kinds of perceptual grouping going on in that clip. We were able to group regions of colour and pattern into individual objects in the scene: chairs, people, clothes, limbs. When these objects moved in the scene they continued to form a single mental object even if they changed position or shape. When objects moved, we were able to track the movements of their component parts, so that we didn't expect noses to move independently of faces, or hands independently of arms. When the camera shifted from person to person we identified the change, and we tracked and associated the two voices with the appropriate person. But of course one of the most interesting ways in which we find we can use our powers of prediction in this clip is in our ability to predict the spoken language."

Hierarchical prediction

"If I had stopped the video right in the middle of a dialogue turn, we would have been able to make all kinds of predictions about what was going to happen next: we could have predicted roughly: the duration of the next glottal cycle, the spectral shape of the next sound element, the next phonetic segment, the next word, elements of the rest of the syntactic phrase, some idea of the next dialogue act or of the next topic of conversation. In this famous exchange, I am sure many of you know the exact words! But that is not a necessary part of the argument. The point I want to make here is that prediction operates at many levels and at different timescales. When I have been saying that the perceptual system predicts the future, what I need to add is that it predicts the future of the sense data and the future of the mental representations derived from that sense data. Prediction is hierarchical, and at each layer we predict the future of some input data through the generation of mental objects which are themselves just input data to another level of prediction. A diagram might make this clear."

Hierarchical prediction machine

"This is part of a prediction system that operates at multiple levels - I call it a hierarchical prediction machine (HPM). Each level contains a prediction system that learns to predict the next time step of some input data vector. In making that prediction it generates hypotheses stored in the form of an internal state vector which are passed to the level above as that level's input vector. However to make its prediction it can make use of three elements: the past input vector, the state vector derived from past input, and the prediction of the next state vector coming down from the next highest level. These three pieces of information are combined to predict the next value of the input vector. It is conceivable that each predictor component operates in fundamentally the same way, but that the size of the input vector, state vector and the amount of smoothing in time and vector dimensionality may vary from level to level. For example, it may be a good idea in vision for higher levels to describe larger objects on bigger timescales."

What an HPM could do

"Just to make the idea of a hierarchical prediction machine more concrete, here are a couple of examples of what you might use one for. You might wire an HPM up to the output of a video camera and walk around the room with it, so that the HPM gets a view of every object in the room from lots of angles. What you'd hope is that the HPM would develop internal states representing, for example, walls, furniture, lights, and the relative sizes and positions of objects. You might be able to request of it, what would I see if I put the camera here in the room and looked in this direction? Or here is an audio example: we connect an HPM to a talk radio station and let it process a 24-hour/7-day signal of people talking. We would hope that the HPM would develop states corresponding to syllable components, to common words and phrases, to particular broadcasters, to particular styles of programmes. We would be able to ask what is the next most likely word, or person, or programme. I hope you can see why I am so excited about the idea of an HPM, and why it matters whether it would be possible to build one!"

Could we build an HPM?

"I am not sure that a system yet exists that performs hierarchical prediction as I have described it. However, it is evident that there are a range of technologies available in pattern recognition which could be called upon to try to build one. For example we have a range of powerful learning algorithms for sequences, such as recurrent neural networks, stochastic context-free grammars, and even a new idea called hierarchical hidden Markov models. Such algorithms are able to learn the statistics of sequences and can be trained by supervised training methods; although I have yet to see these systems being used to create a multi-level predictor. To build a learning system also requires an 'objective function', an algorithmic statement of the performance of the learning system. This seems pretty straightforward for prediction, just try to minimise the error of prediction. But there are interesting alternatives related to probability, in particular trying to minimise the number of bits used to encode a sequence of sense data, which is equivalent to finding a minimum entropy description of the data. Finally, there are some interesting issues to do with the constraints within which the learning system has to operate. The memory and processing requirements of a predictor that worked with the same fidelity sense data as a human being would be enormous; but there is good reason to think that large volumes of high quality data are beneficial to prediction. It is the opposite to the classification problem, where it is important to discard all but the most discriminatory information. In prediction, information which is highly correlated is potentially useful in finding out how grouping whould be done. We are fortunate to live at a time when available computational power in the form of digital computers is increasing exponentially. One estimate is that within ten years digital computers will be able to process and store as much information as a human brain. Speed is another interesting issue, because it would make little sense to build an HPM which worked more slowly than data was available - we would just have to store and delay the data. It would make more sense to build a system that worked in real time, so that we could just connect reality up to it!"


"I have said already that I have been thinking about these ideas for a couple of years, so it should not be a surprise to find out that much of my thinking has been about how I might actually build an HPM. I certainly don't claim that the demonstration system I am going to show you is a practical implementation. You should treat it more as an early prototype which highlights some of the possibilities and also some of the design problems. In this demonstration [see first picture below] the input to the predictor is a single vertical column of pixels taken from an image of a character sequence. You can see the input to the system as the identified narrow column of pixels about one third of the way across the top row. The pixels to the left of that column are the past inputs to the HPM, while the pixels to the right are the future inputs. The second row in the picture displays the predictions of the system. The identified column one third of the way across is the prediction for the next column of pixels that the system thinks will follow the input column. Pixels to the left of this output column on the second row are past predictions. Ideally the predictions should match the inputs exactly, but of course be one column to the left of the inputs.

In this first picture you can see that the predictions are not working well [points to the past predictions at the left of the bottom row]. In fact the system has only been trained for a very short time on the input data. Training involves presenting the pixel data in sequence many times to the HPM and letting it adapt its internal operation according to its prediction error. I will describe the actual mechanism in a moment, but this is just supervised training, where the input is the current column and the right answer is the actual next column.

However this picture [above] shows what happens when training is continued over 15,000 training cycles. You can see immediately that the quality of the predictions is very high, and that the individual characters are readily identifiable. This predictor has learned the digit sequence almost pixel perfect in that it knows at any point in the input sequence what column of pixels comes next. I should say at this point that the memory of the system is very limited, barely more than one time step, so it is not just memorising the pixels. To solve this problem it must generate internal states which identify where the input is within the sequence. I have not yet spoken of the area of the display to the right of the prediction output. This is the 'fantasy' area. In this area, the HPM is being asked: 'OK so you've made a prediction, if this was true, what do you think would happen next?' Thus the area at the bottom right of the display is the HPM's prediction for 2,3,4,... timesteps ahead. From the fantasy area you can see that the quality of its one step prediction is so high that it is able to predict the entire sequence of digits.

This third picture [above] shows an output after a small amount of training on a short character sequence. What is interesting here is the type of mistake being made within the fantasy area. You can see that the HPM is predicting characters, but they happen to be the wrong characters. This is exactly the kind of mistake we would expect for a hierarchical prediction system. We would hope that the predictor would develop an understanding of character shapes independently of its knowledge of character sequences. That way it might get the character sequence wrong, but the shapes of the incorrect characters perfect!

I shall briefly describe the internal working of the demonstration system, but if you would like more technical detail, please come and talk to me afterwards. Essentially the demonstration has three prediction layers, and each layer is a functionally identical predictor. Each predictor takes in an input frame and updates its memory of current and past inputs with this new data. The memory in fact is just an exponentially decaying image of the data, with a very high degree of damping. The combination of current input and memory form a vector of increased dimensionality, and so these are input to a 'learning vector quantiser' to reduce the dimensionality. This procedure effectively clusters the input data, such that the distance between the input data and the cluster centres becomes the internal state of the predictor. This state vector (an array of distances to cluster centres) is then output to the next layer, where it becomes the input data requiring prediction for that layer. Once that higher layer has made its prediction it passes that down so that our layer has available to it the three pieces of information from which it can make its prediction. These three pieces of data are combined using a very straightforward multiple layer perceptron trained by back propagation. The output of the neural network is the prediction of the input to that layer which is then passed to the layer below. The diagram doesn't show the complexities of training and feedback, but essentially in training the system knows what the inputs are at the current and next time step, so that it can adjust its internal parameters when the prediction error is high. Both the neural network and the vector quantiser are affected by this training."

Artificial perceptual system

"In this talk I have introduced the idea of perception as a process of hierarchical prediction and I hope that you now have some idea that this could have some practical implementation. In my last slides I will briefly discuss some of the implications of the idea, firstly with respect to artificial perceptual systems, and secondly with respect to the processing of language. If it were possible to build a perceptual system along the lines of a hierarchical predictor, it would be interesting to link the states of the predictor to the process of deciding the actions of some artificial organism. For example one might associate the activity of certain mental states with 'pleasure' and then let the organism determine which combination of actions maximises the activity of that state! This however is just reinforcement learning, and I am not sure whether it is powerful enough to achieve any interesting behaviours. There is, however, an interesting observation to be made about such an artificial organism: as you can see in this diagram, the mental states are both the outcome of perception and the driving force for action. The system learns to react to these states that are triggered by external sense data. But importantly, these states are opaque to us - it may be impossible for us to determine what the activity of these mental objects actually represents. This is because the states mean something to the organism but not to us. This is the opposite to the normal philosophical conundrum in artificial intelligence known as the symbol grounding problem. If we choose to implement an artificial organism by programming the internal states (to represent 'a table', 'red', 'the word /stop/') and by training the relationship between these and external data, then the states may have meaning to us, but they have no meaning to the organism. The organism doesn't 'know' what the activity of these states represents in the real world. In the artificial perceptual system I have described, these states arise from interactions with the sense data, and thus represent the system's interpretation of the world. In a real sense these states have meaning because they both represent and constitute the world to the organism."

Language processing

"If we consider that language decoding (perhaps up to word level or the level of literal meaning) is part of the perceptual system, we can frame the problem of sentence decoding in terms of finding the internal states that would have best predicted that sentence occurring. In other words, decoding is a kind of search process by which the best interpretation of a sentence is just the underlying representation that would have most likely given rise to the observed surface form. It is not surprising that such an idea appeals to me because this is just the formulation of sentence decoding that is used in large vocabulary continuous speech recognition systems. And part of these systems is a statistical language model that is used to rate the quality of word predictions! Another area in language where perception as prediction might have influence is in language acquisition. As I have already mentioned, part of the argument for the innateness of language is that reinforcement learning is inadequate to discover language structure. But since this new approach does not rely on reinforcement learning, the problem of language acquisition and innateness is worth reviewing. One particularly interesting aspect for me would be the acquisition of phonological categories. Of course, as mentioned in the last slide, if an artificial system did discover a useful set of states to predict language shape that would not mean that it would necessarily use classical linguistic components like noun-phrases or clauses or embedding. Worse, we would not really know what units it was using!"

Return to robot brain

"To finish I would like to return to the thought experiment I posed at the beginning. Here you are trapped inside the robot brain with thousands of unlabelled lights and buttons. I hope you can see now the importance of having a learning system that can operate from no preconceptions about the world. A hierachical prediction system attached to the lights would automatically discover the objects in the real world that gave rise to the raw sense data. The perceptual system would find a low-dimensionality explanation for the multitude of input data. I have not described how such an explanation may be of use in deciding on action, and we may need hard-wired states representing pleasure and pain to find appropriate actions. All that I might suggest is that we would need a learning mechanism to group individual muscle actions into larger gestures based on a prediction of their effect on our perceived model of the world."


"In this talk I have tried to show that perception can be framed as a machine learning problem where mental objects are simply the learned states of a predictor. I hope you see that this hypothesis explains why perception gives rise to mental objects and how it works. Furthermore I have suggested that prediction operates simultaneously at a number of levels, and that hierarchical prediction technology is coming closer to implementation. Finally, hierarchical prediction challenges conventional views of speech decoding and language acquisition.

Thank you."


Couldn't perceptual processing be innate?

In my discussions I have deliberately oversimplified the issue of innateness because I am not sure that innateness is any kind of solution to the problems that I want to solve. I am willing to accept that some of the lights in the robot brain actually relate to the activity of groups of other lights, and that some buttons are really just combinations of other buttons. I would be quite happy to see some of the prediction mechanism hard-wired, in just the same way as the visual cortex seems to be hard wired to the retina. So the problem for me is not innateness but the engineering of the solution to the problem of perception. I am happy for something to be innate providing it still works by information processing. The choice is not tabular rasa vs. built-in, but information processing vs. magic.

What is the relationship between the perceptual and the conceptual system?

I have been deliberately vague about how the mental objects delivered by the perceptual system interface to the conceptual system. This is because I haven't even started to think this through. Presumably the conceptual system is able to label the objects and access information through those labels. Perhaps we can draw parallels between the two systems, and ultimately frame the conceptual system also within a predictive framework. The most significant issue for me at the moment is the extent to which reinforcement learning from the conceptual system can influence the development of the objects detected by the perceptual system. For example, we know that babies lose the ability to discriminate speech sounds that don't happen to be used to differentiate lexical items in their language. What we don't know is whether this change is motivated by the conceptual system or is just a statistical regularity of the speech data observed by the child.


Below are a few articles that have been the greatest influence on the ideas presented here.

  • Brooks, R., "Intelligence Without Representation", Artificial Intelligence Journal (47), 1991, pp. 139159.
  • Craik, K., "The Nature of Explanation", Cambridge University Press, 1943. [out of print, but here is the important Chapter 5].
  • Dennett, D., "Brainstorms: philosophical essays on mind and psychology", Penguin Books, 1997, ISBN: 0140258000.
  • Dorffner G., "Radical Connectionism - A Neural Bottom-Up Approach to AI", in Dorffner G.(ed.), Neural Networks and a New AI, International Thomson Computer Press, London, 1997, ISBN: 1850321728


Please send constructive criticism and offers of help to


All intellectual property on this page belongs to Mark Huckvale (© 2002 Mark Huckvale University College London).

© 2002 Mark Huckvale University College London December 2002
Valid CSS! This site uses
Cascading Style Sheets.