From: Computational Linguistics for Speech and Handwriting Recognition, AISB Workshop, Leeds University, April 1994.
Since speaking and writing are both forms of linguistic communication, and since both share the idiosyncrasies of human production, fertile analogies may be drawn between the two in the fields of generation, perception and the encoding of messages. Both forms of communication exhibit enormous variability, contextual effects, and speaker/writer dependence. Thus it is not surprising that methods for automatic speech recognition and automatic writing recognition have many common aspects, both in terms of their architectures and their performance. This paper discusses the similarities of contemporary speech and writing recognition systems in terms of their organisation and weaknesses.
The paper first argues for a much deeper analogy between speaking and writing than is normally considered - specifically to common aspects of phonological realisation and this discussion leads synergistically to ways in which speech recognition might benefit from writing recognition ideas and vice-versa.
The paper then discusses three areas of recognition: (i) the use of delayed decision making in utterance recognition, (ii) the use of linear segment-in-context models for analysing variability, and (iii) the domain of contextual influences.
The paper concludes by tying these themes together in a plea to
study how the speech and writing material we attempt to recognise
actually functions as communication.
We note some points of contact between speaking and writing:
On-line and off-line recognition: Both speech recognisers
and writing recognisers operate on the basis of an acquired signal:
for speaking we can not normally acquire the articulation and
so must rely on the sound pressure signal; while for writing we
have the choice of recording the pen-tip movement (on-line) or
the visual consequences (off-line). Interestingly, the use of
the pen-tip signal rather than the image makes an enormous difference
to recognition accuracy in the writer-dependent (c.f. speaker
dependent) case (Nouboud & Plamondon 1990, Taxt & Olafsdottir
1990). If the analogy holds, speech recognition from articulatograph
input should give good speaker-dependent recognition since speakers
might be expected to use a single system of articulation for phonological
events. But, just as in handwriting, on-line recognition would
probably not perform so well in the speaker-independent case.
Normalisation of Signal: For further processing, the sound
signal or the pen-tip signal or the visual image need to be normalised,
for speech this is done by considering the short-time amplitude
spectrum instead of the waveform; in writing by normalising for
size, rotation and line size. The choice of normalised representations
can be associated with the processes of self-monitoring in the
production. The short time amplitude spectrum is more useful than
the waveform, because the ear is insensitive to waveform shape,
and thus the production goal is to get the spectral content correct
rather than the waveform. Similarly, writing production aims to
get the shape correct rather than size or rotation or line-width.
Realisation of segmented underlying form: In both speech
and writing, co-ordinated and exquisitely-timed muscular control
is effected to produce an articulation. A common 'phonological'
view is that speech and writing are the realisations of a linear
segmented underlying representation of the message: phonemes
or letters. The realisation of such abstract units by articulation
alters them from having discrete features, values and positions
into a continuous physical signal. The result is that information
about the underlying form is smeared through the signal, so that
a single measurement of the signal has been influenced by a number
of underlying units. Conversely information about the identity
of a given underlying unit is obtained from a range of times.
Recognition of underlying units: The physical processes
of realisation in both speech and writing have consequences for
the generated form of the abstract unit according to context,
according to speaker/writer, according to different environments
and on different occasions. A recogniser must model the nature
of the variability of these realisations to be able to determine
the likelihood that a given signal could have a given underlying
transcription. The use of phone-in-context and letter-in-context
models for recognition shows how current systems make a direct
link between physical shape and a very limited phonological context.
Contextual variants: As well as associating letters with
phonemes: abstract entities used to differentiate lexical items,
we might also associate contextual variants as allographs/allophones,
or realisations as graphs/phones. We might push this one step
further by considering the pattern elements in both a letter and
a spectrographic representation of a phone. If speech perception
results hold analogously, the perception of a letter comes from
a combination of perceptions of these pattern elements, and trading
relations exist by which the influence of one cue is balanced
Theories of perception: The equivalent to the motor theory
of speech perception - which maintains that we overcome the inherent
variability in the sound signal by a mental reference to the means
of production - would be a motor theory of writing perception,
in which readers inferred an underlying pen manipulation which
was more invariant than the visual form. Perhaps in this form
the motor theory is seriously undermined.
Supra-segmental similarities: There are also interesting
analogies above the level of the segment: volume with size, syllables
with letter groups, tone groups with lines, rhythm with spacing.
Note, however that for both domains these are just the properties
of the signal which are normalised out from the input at an early
stage in recognition. The normal justification for this is that
supra-segmental properties do not predict segmental shape; however
see section 5.
Modelling techniques: Contemporary recognition systems
in both areas are using neural-net and HMM models of linear phonological
units in context trained on large quantities of reference material.
Both systems use language models to provide sequential constraints
in recognition, although writing systems tend to use letter-sequence
models while speech systems only use word-sequence models. The
only justification for this is that writing recognition requires
unlimited vocabulary size.
From these points of contact we see that it is quite justifiable
to relate the nature of the recognition problem and the architecture
for recognition across speech and writing.
3. Delayed decision making
The single most important advance in automatic speech recognition is often said to be the introduction of dynamic programming (DP) - a particularly efficient method for solving some tricky graph-search problems. The introduction of DP not only allowed non-linear methods of time-alignment of signals and fast methods for HMM recognition, but heralded a conceptual change in recognition architecture from bottom-up to top-down. No longer were speech recognition systems constructed as a series of transformational tasks in which sounds were converted to phones, phones grouped into words, words into sentences. Instead syntactic, lexical and phonological information to be 'compiled' into a complete production model for every allowable sentence. Recognition then proceeded by finding the single best input to this model that matched an unknown signal; and this was only feasible with DP.
Given this approach in speech, it is therefore quite surprising
that writing recognition systems still put so much weight on the
bottom-up recognition of letters. Take the example in Figure 1.;
here the word 'course' can not possibly be recognised on a letter-by-letter
basis. It may be recognised only when placed in a much larger
context of about 2 words on either side. While it is clear
how this particular realisation of the word 'course' arose, how
the 'ou' letters have been formed, how the 'rs' letters have been
coalesced, this is post-hoc rationalisation once the word
identity has been established. Similarly in continuous speech
recognition systems: the recognised phone sequence is established
after the best fitting sentence is found - not as a preparatory
Figure 1. Difficult example for linear segment model
This concept of 'delayed decision making' has affected our views of the speech recognition problem profoundly. Early decision making has been shown to have serious consequences on recognition performance since given signal events always have more than one interpretation. To pass up alternatives for resolution at a higher level invites a problem of combinatorial explosion: too many competing hypotheses without information adequate for disambiguation. The delaying of decisions, combined with a statistical framework which allows probabilities arising from observations to combine with probabilities arising from sequential constraints to be combined, produces a much more effective result - but at the expense of losing the exact association between data and interpretation. A writing recognition system on these principles could not indicate which part of the writing is which recognised letter. This may or may not be appropriate for writing recognition applications.
4. Use of segments to model physical realisation
In both speaking and writing we are concerned with linguistic communication - an encoding of a message using the richness of human language. An important element of that communication is the separation of the meaning of the message from the patterns used to encode it. Neither phonemes nor letters have meaning of themselves, and hence neither do the sounds or graphical shapes.
The phonological regularities apparent in both pronunciation and spelling demonstrate that one aspect of the arbitrary mapping between meaning and production is between lexical entries and phonemes; that the mapping arises in the mental lexicon. That this seems obvious demonstrates the pervasiveness of the linear segmented view of pronunciation and spelling. For it needn't be so; the arbitrary mapping could be directly from meaning to sound or shape. The word 'bed' may map to the sound "bed" and only by phoneticians to the transcription /bed/. The phonological regularities observed in production could arise from some limited perceptual capability rather than from the structure of the lexicon (maybe we can only see letters and hear phonemes). However the view that production is controlled by linear segments leads to our current recognition systems that are predicated on phone/letter models influenced by phone/letter context.
That such a model is inadequate is widely known, although very few efforts have been made to extend the recognition architecture to richer models of the meaning-to-sound mapping. There are two aspects to the improvements required: firstly to recognise the influence of the production system on the nature of variability; secondly to recognise the importance of supra-segmental information.
The influence of the production system is that the sound/writing stream is not a sequence of discrete events localised in time/space. In phonetics we are happy to talk about events such as 'bilabial closure' or 'lateral release' as if they can be isolated in the signal. In writing we talk of graphs being made up from a number of 'strokes' - pen-tip gestures. In neither case is there good evidence that the signal can be decomposed into such events. In the data that we collect, these elemental constructs - if they ever existed at all - have merged into a stream that is continuous in time and value. What we appear to be doing is interpreting the signal with respect to a model of production which imposes a discrete view: 'if this is the data, then it must have come from this individual event'. While this is a perfectly reasonable approach to form a descriptive model of the data, it shouldn't be pushed into service as a model for the production mechanism itself without further evidence. Thus a phonemic analysis of pronunciation doesn't imply that phonemes are used in the control of the articulators. Similarly, a letter-by-letter analysis of handwriting may not be a good model of manipulator movement (cf. Figure 1).
The deficiencies of a linear view in recognition are that we miss regularities in variabilities that are shared by a number of phones/letters while at the same time limiting contexts narrowly to immediate neighbours. Huckvale (1993a,1993b) describes how a tiered phonological representation goes some way to address these issues.
The second aspect: the importance of supra-segmental information,
is that the speaker/writer uses prosody to direct the listener/reader
how to set about decoding the utterance: how to break the input
up into manageable units, how to separate old information from
new, how (probably) to disambiguate between different syllabifications
and word boundary possibilities. What is important for contemporary
recognition systems is how the supra-segmental aspects affect
the segmental realisation.
5. Information loading
The relationship between 'communicative load' and quality of production is well known. Lieberman (1963) showed that words are given longer and more intelligible pronunciations when they occur in contexts which do not predict them ('the word which you are about to hear is nine').
Thus one reason we have difficulty identifying phones/letters is that the production mechanism appears to have an ambivalent attitude towards presenting clear realisations: letters/phones are clear where they need to be according to the decoding scheme assumed of the listener by the speaker. Otherwise the production is merely sufficient to do the communicative job reasonably reliably. The consequences are that the transcription model of the signal fails to bridge the gap between signal and lexicon. Sufficient information for discrimination of the lexical entries is present in the signal given the context, but a phone sequence mis-represents its phonetic content. By forcing the input to be a sequence of phones/letters we force the representation to be in error; the consequences of error are then spread throughout the recognition system at all levels: more word candidates, more word classes, more syntactic constituents, more interpretations.
The word 'course' can be identified in Figure 1 because we don't
first process signals into transcription and then perform lexical
access. Instead lexical possibilities constrain the information
that needs to be extracted. Writing shows us that we might consider
a recognition system for speech in which phonology is used to
describe the organisation of the lexicon, and hence how choices
between words at different junctures in a sentence are the actual
arbiters of phonetic measurement (Huckvale, 1990).
In this paper I have tried firstly to show that it is legitimate to discuss together the problems of speech recognition and writing recognition; in section 2, I have outlined some of the many points of contact between the production and the recognition of spoken and written utterances.
Secondly I have introduced three observations common to speech recognition and writing recognition: (i) that decisions must not be made too early, (ii) that linear segmentation is a rather crude implementation of phonology for recognition, and (iii) that the realisation of segments depends on their communicative load not just on neighbouring segments.
I believe that there is an important single lesson that can be drawn from these observations. The delaying of early decisions is useful because it does not make explicit a low-level feature/segment representation - whatever scheme is chosen (HMMs of phones, for example) the modelling is weak enough to allow the influence of higher level knowledge. The signal is not recognised as a segment sequence merely that knowledge of production variation is modelled with segments. What is recognised is always the whole utterance. But if segments are being used simply as models of pronunciation variation, then they are doing quite a bad job. The variation of a phone/letter is dependent on a wider context than simply its neighbours and in the characteristics of the production system. If we seek a model of variability then we need to consider prosody and production. But prosody is not some arbitrary interference on the segment string - it is a demonstration of the underlying organisation of the message; it provides essential cues to the recovery of meaning.
The key to all this is to observe that utterances have been produced for a reason. The difficulty we have in recognising segment sequences arises because we ignore this. Somehow the conception has arisen that variation in realisation due to the production system or to prosody is some kind of 'noise' that hides the true segmental string - whereas these aspects are just the opposite: vital clues as to how the utterance should be interpreted. Our modelling of productions should take into account the communicative function at that stage in the interaction with the machine, not just how a segment depends on its neighbours. To model communicative function means that we need to study why as well as how utterances were produced. This knowledge can be incorporated into recognition just as word syntax is used currently.
The missing link in existing speech and writing recognition systems
is just the concept that the speaker/writer wants to help the
listener/reader. Producers of utterances know they must obey pragmatic
rules of quality and relevance for communication to be possible.
These rules, far from perverting the pure segmental model of production,
enhance an inherently variable system with a larger scale structure
directly linked to the utterance intent.
The author is grateful to Wendy Holmes for constructive criticisms
of an earlier draft of this paper.
M.A.Huckvale, (1990), The exploitation of speech knowledge in
neural nets for recognition, Speech Communication, p1.
M.A.Huckvale, (1992), Illustrating Speech: Analogies between speaking
and writing, in Speech, Hearing and Language - Work in Progress
6, Phonetics and Linguistics, University College London.
M.A.Huckvale, (1993a), Tiered segmentation of speech, in Speech,
Hearing and Language - Work in Progress 7, Phonetics
and Linguistics, University College London.
M.A.Huckvale, (1993b), The benefits of tiered segmentation of
speech for the recognition of phonetic properties, Eurospeech-93,
P. Lieberman, (1963), Some effects of semantic and grammatical
context on the production and perception of speech", Language
and Speech 6, p172-5.
F. Nouboud, R. Plamondon (1990), On-line recognition of handprinted
characters: survey and beta tests, Pattern Recognition,
T. Taxt, J.B. Olafsdottir, (1990), Recognition of handwritten symbols, Pattern Recognition, 23, pp1155-1166.