Why are machines less proficient than humans at recognising words?

Mark Huckvale
University College London
May 2003

In this short essay I would like to discuss some ideas about why the ability of a machine to recognise speech at the phonetic level is so inferior to the ability of a human. I'll try to catalogue some possible causes, then give my own opinion about which are most important.

1. The task

In this essay I will concentrate on the recognition of spoken words in isolation. The task is this: a person is given a printed word to read and the audio of their production is captured and presented to a machine. The machine has to identify the printed word. Certain simplifying assumptions are usually made: the word is a relatively common word (say in the top 65,000 words in the language). The audio signal captures sufficient information for a human listener to recognise the word with high accuracy. Performance is assessed by measuring % words correct over many words, perhaps spoken by many different speakers. We compare the accuracy of a machine to a human being performing the same task from the same audio signal.

The use of isolated words allows me to discuss the acoustic modelling of speech without being involved in issues about grammar, meaning or dialogue. This is not to say that such issues do not impact the problem of building speech understanding systems, but for now I would rather believe that the conventional disassociation of language model and acoustic model found in speech recognition systems is actually close to the truth. In any case we can do isolated word recognition experiments with humans and machines, where no sentence context is available, and we still find that humans are significantly better than machines. If we address this discrepancy, it is likely that there will be beneficial consequences for more general tasks.

2. The scale of the problem

There are not that many studies that make fair comparisons between humans and machines. For example, humans can recognise perhaps several hundred thousand word types, whereas most machines are explicitly designed with more limited vocabularies. If machines are tested only on words from their limited vocabulary then their performance will be inflated. Thus the performance of a machine on, say, digit recognition doesn't really address any significant general issues about the differences between human and machine recognition, since human recognisers do not have the same limitations.

Human listeners can also have access to other information about the task that is unavailable to the machine. For example, the listener may know that all the words in the test come from the same person; or that all words are spoken in the same accent; or that all words were recorded in the same acoustic environment. It may be that the machine has not been allowed to use those constraints and can only assume that each word might have come from a different speaker, or be produced in a different accent, or recorded in a different environment.

Also, machines tend to be tested against a transcription of what the speaker was asked to say, rather than what a human being might recognise. This is a particular problem when there is no single consensus across listeners as to the right answer. Given a production /pO:/, is the right answer "pore", "paw", "poor" or "pour"? This problem is endemic to phonetic segment recognition, where inter-annotator discrepancies can contribute significantly to the error rate. One might propose that the machine is right if it matches any of the human listener judgements, rather than the label of the word given to the speaker.

Conversely, machine performance can be inflated by too close a match between the data used for training the recogniser and the data used for testing. A large degree of overlap means that generalisation performance is not being properly evaluated. Ideally test data should be of words, speakers and environments not present in the training data.

With all these caveats in mind, then, what is the scale of the problem? Lippman (1997) gives us some comparisons between the error rate of machines and humans on a range of tasks. On clean digits, machines (0.72% error) were about 7 times worse than humans (0.1% error). On alphabetic letters, machines (5% error) were about 3 times worse than humans (1.6% error). On a 1000 word vocabulary task, machines (17% error) were about 8 times worse than humans (2% error). On a 5000 word task in quiet, machines (7.2% error) were about 8 times worse than humans (0.9% error), but at a signal-to-noise ratio of 10dB, machines (12.8% error) were about 11 times worse than humans (1.1% error). On phrases extracted from spontaneous speech, machines (43% error) were 11 times worse than humans (4% error).

Roughly, then, machine error rates are about an order of magnitude worse than humans, even when the machines are trained and tested on similar materials. I'll next try and catalogue possible causes of this discrepancy in performance.

3. General classes of possible causes

3.1 Signal capture. For clean speech, a monaural audio signal would seem to be perfectly adequate for both humans and machines.  It is likely that digital signal capture and analysis is significantly superior to the ear, particularly with regard to frequency response, signal to noise ratio and linearity. But for speech spoken in a noisy place, humans can rely on two other sources of information denied to machines. Firstly humans have two ears and a system for swivelling the head; this allows humans to use information about direction to improve the signal to noise ratio. Secondly humans can see the person speaking, and the visual impression of the face also provides information useful for word recognition.

3.2 Acoustic feature vectors. Most speech recognition systems operate solely on spectral envelope features, commonly cepstral coefficients calculated from frequency-warped spectra. Such features have been found by trial and error to be most effective for recognition. However all kinds of questions remain: what role is played by other aspects of the signal, particularly its harmonic and fine temporal structure? Fine detail in the signal is being treated as uninformative noise, when with the right parameterisation it might make significant contributions to segmental intelligibility. How best should we model the dynamics of the signal? We treat frames of acoustic features largely as piecewise stationary observations, whereas we know the ear is particularly sensitive to transients in time and frequency. How does choice of signal representation interact with the training of statistical models of the realisation of phonological elements? Our acoustic feature vectors are very small and designed to contain statistically independent values so that our spectral distance metric can assume a diagonal covariance; but this by itself does not guarantee adequate discrimination between phonetic categories. To what extent should the acoustic feature vectors be optimised for a particular language? We know that human listeners adapt their perception to improve the discriminability of phonological classes. How should the system deal with noisy signals? Should environmental noise be taken as just random variability in the same way as contextual variation, or should we try to recognise the characteristics of the noise as well as the speech? Should the acoustic feature representation be different for different speakers? Is adaptation to vocal tract length or voice quality a property of the low-level signal representation, and if not how can these systematic changes be modelled elsewhere?

3.3 Acoustic-phonetic modelling. Most speech recognition systems use a syntactic pattern recognition component to map between acoustic feature vectors and phonetic labels. The two main technologies are Hidden Markov Models (HMM) and Recurrent Neural Networks (RNN). It is possible to unify these two approaches by viewing phonetic labelling as a two stage process. In the first stage the probability that each input frame could have been generated by each possible phonetic unit is calculated. This is the forward probability table. In HMMs the table is calculated for each state within each phonetic segment model, while for RNNs the table is calculated for each phonetic segment. Experience seems to show that it is very useful for such models to deliver probabilities rather than abstract scores. This allows the bottom-up evidence collected from the signal to be combined with top-down probabilistic constraints provided by the utterance context. In the second stage a sequential decoder then finds the best path through the forward probability table such that each input frame occurs once and that the product of the probabilities along the path is maximised. The path is also constrained by state-to-state transitions which may in themselves have probabilities associated with them. The phonetic transcription can then be found from the best path, or a phonetic lattice of best paths can be generated, or lexical search can be combined with the calculation of the best patch by adjusting the phonetic segment transition probabilities according to whether the previous segments on a path constitute part of a known word (this is the viterbi algorithm).

There are many possibilities for weaknesses in acoustic-phonetic modelling. How, for example, should one choose the set of phonetic units? Too many units may mean poor estimates of the probability that a frame would be generated by a model; too few units may mean that information useful for discriminating words has been lost. A common approach is to generate a large number of possible phonetic units (often 'triphones' – models of a phonological segment in the context of a left and right phonological segment) then reduce their number by pooling data across units, state-tying and clustering. Also how is the probability of a segment related to the probability of its constituent frames? The assumption that frames are independent observations must be incorrect, since the rate of spectral change is limited by the dynamics of the articulators, so adjacent frames are in fact high correlated. The phonetic units themselves may not lead to simple acoustic feature distributions; it is likely that the context in which a phonological unit is realised will significantly affect its realisation such that acoustic feature distributions will not be context independent nor normally distributed. Contextual dependency could be considered advantageous to recognition in that the information about the identity of a segment is distributed around the temporal centre overlapping with information identifying adjacent segments. Context dependency should provide constraints on possible combinations of segments, since only certain acoustic realisations occur in certain environments. Machine phonetic models tend to be trained with speech from multiple speakers, so these models are unable to exploit correlations across segments caused by common speaker characteristics: for example vocal tract length or spectral tilt. Speaker adaptation is possible to shift the average segmental realisation for a segment towards the speaker's mean, but knowledge is not available to shift realisations of segments that have not been observed.

3.4 Phonological-phonetic modelling. The sequence of recognised phonetic units are assumed to be realisations of an underlying phonological structure of each word. This mapping between phonological units and phonetic units is typically very simple in machine recognition. Often phonetic units are nothing more than "phone-in-context" models, meaning that each phonetic unit is chosen only on the basis of adjacent phonological segments. For example to model the vowel in /pIt/, a phonetic unit [I#p_t] is chosen, that is the type of [I] that occurs in the context /p_t/. A different phonetic unit might be chosen for the vowel in /bIt/, namely [I#b_t]. This type of analysis can certainly lead to confusion, for example where the phonetic unit [I#p_t] is also used to model the vowel in /spIt/ where a phonetician might choose [I#b_t] since the plosive is typically unaspirated in that context. In general the phonological-phonetic modelling in machine recognition is often just table substitutions with no alternatives or substitution probabilities. Typically there are no mechanisms to account for simplifications that occur when words are put into whole utterances, for example: assimilation, elision, epenthesis, vowel reduction or lenition. Also the whole approach is very segmental, where phonetic variation is related to the segmental phonological context and not to the syllabic structure, prosodic environment or lexical identity.

3.5 Phonological structure of words. It is very important to acknowledge that words have some internal structure, because recognition would be far worse if words were treated as independent unanalysable noises. However, the usual procedure by which an appropriate phonological structure is chosen may not deliver the best machine recognition.  The phonological structure of words used in most machine recognition is based on a kind of segmental phonemic analysis of words which in turn is based on the idea of contrast. The word "bit" can be shown to have three logical contrastive units because of the existence of words such as "pit", "bat" and "bid". Clearly if it were possible to first segment an incoming word into these three independent segments, then the best way to recognise the word would be to build a discriminative classifier to tell apart [b] from [p], [I] from [{] and [t] from [d]. However, initial segmentation is not possible and you must combine segmentation and classification using something like an HMM. But HMMs do not act discriminatively. Instead HMMs build a probability distribution of the realisation of each phonetic unit independently. Contrast between [b] and [p] only arises if the respective distributions are different and deliver different probabilities. But the acoustic realisations of [b] and [p] have much in common. This means that a lot of the trained probabilities in the two segments do not contribute to the contrast. Thus we come back to the original criticism that phonological units are selected to contrast words, whereas the actual acoustic models are models of variability. Perhaps we need a phonological analysis which explicitly sets out to describe the variability of words rather than to represent the necessary contrasts in a parsimonious manner. A short example: to model [tr] in words like "tree" it might be better to constitute a separate model for the affricated combination [tr] rather than try to build it from models of [t] and [r].

3.6 Pronunciation variability. The acoustic realisation of word types is highly variable across speaker, context, environment and repetition. The challenge for machines is to model the variability while maintaining adequate discrimination. The more the material to be recognised varies, the more machine recognition performance suffers.  Recognition of multiple unknown speakers is worse than recognition of a single known speaker, and recognition of words from multiple accent groups is worse than recognition from a single accent group. Also variability in utterance context, speaking style and speaking rate cause problems as do changes in recording environment and channel. The inability of systems to deal with such variation even when the system can given information about the nature of the variation, is a current weakness. In general, many sources of variability are lumped together rather than modelled separately. For example speech from multiple speakers end up in a single distribution. Speech recorded in one style and environment is used to train a recogniser to be used to recognise speech in a different style or environment. Although systems can often be given information about the speaker, his accent, the background noise, the utterance context; systems seems poor at exploiting that information to change the way in which acoustic phonetic recognition is performed. Too much variation is treated as noise than as useful systematic, predictable variety.

3.7 Use of the lexicon. Machine recognition operates on sequential digital computers and adopts a search strategy designed to be efficient for single processor calculation. We should not forget that the brain has a parallel architecture and may be able to use the entire knowledge it has of all words in the lexicon in parallel during decoding. Thus the phonological analysis we are forced to interpose between sounds and words in our recognisers may limit how well we can model acoustic variability. If, instead, phonological knowledge simply identified which words had common logical structure, then words themselves could be used to recognise the signal. The activation of word hypotheses could operate directly from the signal, with no intermediate phonetic representation, but with higher-level links identifying which words shared structure (see Huckvale 1990, for more details). The cross-word links would ensure the sharing of knowledge of the variability of phonological units implicitly stored in the individual word models. The advantage of whole-lexicon decoding is that it allows for acoustic variability to follow lexical neighbourhoods. That is, we would expect words with few lexical neighbours to be more variable than words with many neighbours (or, equivalently, more predictable words to be less precisely articulated than less predictable words). However an interposed phonetic level of variability modelling precludes these kind of influences having an effect on recognition.

4. Areas most in need of improvement

I have briefly outlined some ideas for the causes of the relatively poor performance of machine recognition of words. In this section I would like to give my own particular opinions about which of these are most significant and most in need of improvement.

4.1 Modelling of variability with segmental phonological units. HMMs are models of the distribution of the acoustic feature vectors over some time interval defined for a finite inventory of phonetic units. It is not clear that a set of phonetic units based on phonemes is ideal, nor that the segmental phonological pronunciation of a word found in a dictionary is a good way to model its pronunciation variability. I would prefer to see a data-driven approach that established: (i) what bottom-up acoustic components are relevant to describing speech, (ii) how those units vary across repetitions of individual words to establish a suitable phonetic framework for a word, (iii) how to predict the phonetic framework for a word given knowledge of the phonological structure of the word, the speaker, the accent, the style, the context, etc. I would suggest the bottom-up analysis should be driven by highly redundant acoustic descriptions, where the existence of correlations can be taken to be evidence for constraints in the underlying generative process.

4.2 Calculating the probability of a word from the probability of its phonetic elements. As part of a speech understanding system, the acoustic model needs to deliver good estimates of word probability given the acoustic evidence. It is not clear that the current approach that calculates word probabilities by multiplying phone probabilities provides goods probability estimates. It seems likely to me that the phonetic variability seen within a word depends on the identity of the word. Such variability might be a consequence of general prosodic structure or of general lexical predictability. An example of a prosodic influence would be the observation that "library" can be safely pronounced /laIbrI/ with the elision of a whole syllable /r@/, while "remark" cannot be pronounced /mA:k/.  An example of a predictability influence would be the significant phonetic reduction seen in function words. Thus elements of variability are not just properties of phonological segment sequences, and the probability of a word cannot just be calculated from the probabilities of its segments.

4.3 Modelling extraneous variability. The acoustic form of a word varies according to speaker, context, environment and occasion.  We can choose to combine all these sources of variability and treat them as "unknown", but much could be gained by trying to model these kinds of variability. The implication is that word recognition should also include sub-systems for speaker-type recognition, speaking-style recognition, environment-recognition, accent-recognition, etc. Recognition of a word then includes simultaneous recognition of its provenance. A human can say I have heard word "X" spoken by a male adult speaker of an RP accent in a read speaking style over a telephone. A machine that also knew or estimated these controlling factors would have the opportunity of determining how these additional factors affect the acoustic realisation of a word. In recognition, the choice of word would be the one that maximised the probability of the combined word, speaker-type, accent and environment categories.

Bibliography

  • M.A.Huckvale, "Exploiting Speech Knowledge in Neural Nets for Recognition", Speech Communication 9(1990) 1-14.

  • R.P.Lippman, "Speech recognition by machines and humans", Speech Communication 22(1997) 1-15.