Mark Huckvale - Accents Research


A speaker's accent marks him or her as a member of a group. These groups have been defined by geographical areas, by socio-economic class, by ethnicity, or for second language speakers, by the identity of the speaker's first language. As listeners sensitive to accents we can tell whether a speaker belongs to our group, and as talkers we adapt our speech when we want to belong (or appear to belong) to a different group. For this to be possible it must be the case that accents are stylised patterns of speaking that recur across members of the group. We assume these patterns affect word frequency, the phonological coding used in the pronunciation lexicon, the phonetic realisation of phonological units and the prosody of utterances.

Click to hear an example sentence spoken in 14 different accents of the British Isles. (Samples from the ABI Corpus).

Speech technology has yet to deal adequately with pronunciation variation across accent groups. In speech recognition, a mismatch in accent between the speakers used in testing and training can lead to a 30% increase in word error rate. In speech synthesis, synthetic voices are fixed in one accent due to the increasing use of corpus-based synthesis methods operating from the speech of a single speaker.

There are a number of reasons that make accent variation difficult for contemporary speech modelling techniques:

  • Accent variation is not just a shift in phonetic realisation: accents differ in their inventory of phonological segments and their distribution in the lexicon. This means that keeping one dictionary and adapting the mean spectral realisations of phone models is insufficient.
  • Phonetic variation can involve large spectro-temporal changes in realisation: for example, monophthongs can become diphthongs, plosives can become fricatives, and segments can be inserted and deleted. Phone models which are good models of spectro-temporal variation of a phonological unit in one accent may be poor models in another. This means that adapting the dictionary but keeping one set of phone models is also not sufficient.
  • Databases of speech used for training recognisers are not well controlled for accent: it is likely that any given phone model is trained with speech from a number of accent groups. Such impure models confuse attempts at dealing with accents by phonetic and phonological adaptation. A model of /A:/ containing both [{] and [A:] may be useful for modelling "bath" but not "palm".
  • Sociolinguists define accent groups according to convenient cultural indicators rather than on the basis of similarity: it is unlikely that all the known groups are necessary or sufficient. In addition, because the groups are not defined by objective similarity, it is hard to find a representative sample of speakers of an accent.
  • Accent variation is only one component of variability of a speaker: speakers also differ according to their age, size, sex, voice quality, speaking style or emotion, and recordings are affected by environment, background noise and the communication channel. But since accent is a characteristic of a group of speakers, it is hard to control these other influences.

Thus speech technology could benefit from modelling techniques which are sensitive to the particular character of accent variation. Better modelling of accents would allow recognition systems to accommodate speakers from a wide range of accents, including second language speakers. A better understanding of the acoustic-phonetic structure of accents might lead to means for morphing voices across accents which could allow concatenative synthesis systems to speak in multiple accents. Finally, better definitions of accent groups could lead to new sociolinguistic insights into how groups form and change.

Approaches to Modelling an Accent Group

There have been three basic approaches to modelling pronunciation variation due to accent with the aim of accent characterisation and recognition:

Global acoustic distribution

The simplest way to characterise an accent group is to make a model of the probability distribution of the acoustic vectors recorded from a set of speakers from one group. For example, Huang et al modelled four regional accents of Mandarin using a Gaussian mixture model with 32 components to model the pdf of spectral envelope features from 1440 speakers. Accent recognition can then be performed without using a known text or requiring phonetic labelling: Huang et al achieved an accent recognition rate of 85% using gender-dependent models. However, such a global model seems to be a crude way to model differences in phonetics and phonology, particularly when the models also contain other speaker variability.

Accent-specific phone models

Having known text read by speakers of known accent groups allows the building of a set of phone models for each accent. The models can be used in accent recognition simply by finding which phone set gives the highest probability to any unknown test utterance. For example, Teixeira et al obtained about 65% accent recognition rate for five foreign accented English speaker groups. The weakness of this approach is that phonological variation is not exploited, since the recognisers do not necessarily use the same best phone transcription for the utterance. When the text and a phonological transcription is known, the accent can be found using the same phone sequence for all sets and performance is much higher. For example, Arslan & Hansen obtained a 93% accent recognition rate for four foreign accented English speaker groups. However, such an approach requires that sufficient data be available to build phone models, and that this data come from a range of speakers so as to accommodate speaker variability. Thus it assumes that accent groups are known and that training speakers can be assigned to groups.

Analysis of pronunciation system

While accent recognition based on accent-specific phone models works well for a small number of varieties of foreign-accented English, it is not clear that the technique would scale well to the problem of dealing with a larger number of more similar regional accents of a language. We believe a more sensitive technique could come from a study of a speaker's pronunciation system rather than his acoustic quality. Barry et al developed a regional accent recognition technique based on acoustic comparisons made within one known sentence. Formant frequency differences between vowels in known words were used to assign the speaker to one of four English regional accents with an accuracy of 74%.

Barry's idea to look at the relationship between the realisations of known segments rather than their absolute spectral quality was recently advanced further by the work of Nobuaki Minematsu. His idea was to perform cluster analysis on a set of phone models for a single speaker, then study the resulting phonetic tree to establish the pronunciation habits of the speaker. By this, Minematsu hoped to identify where the speaker's pronunciation differed to some norm. However in our research we take Minematsu's idea a step further, and apply it to the problem of accent characterisation and recognition. We use the similarities between segments to characterise the pronunciation system for a speaker, then compare his pronunciation system with average pronunciation systems for known accent groups to recognise his accent.

Accent Recognition Experiments

The accent recognition experiments aim to use the structure of an individual's pronunciation system to help both characterise and identify a speaker's accent. Results so far have been very promising, with accent recognition accuracy approaching 90% for a task involving 14 regional accent groups of British English (a task considerably harder than had been attempted previously).

First results of these experiments were presented at the International Conference on Spoken Language Processing, in Korea, October 2004, in a paper called "ACCDIST: a metric for comparing speakers' accents". The abstract of this is below:

ACCDIST: a metric for comparing speakers' accents

This paper introduces a new metric for the quantitative assessment of the similarity of speakers' accents. The ACCDIST metric is based on the correlation of inter-segment distance tables across speakers or groups. Basing the metric on segment similarity within a speaker ensures that it is sensitive to the speaker's pronunciation system rather than to his or her voice characteristics. The metric is shown to have an error rate of only 11% on the accent classification of speakers into 14 English regional accents of the British Isles, half the error rate of a metric based on spectral information directly. The metric may also be useful for cluster analysis of accent groups.

Results of clustering accents using the ACCDIST metric. Interestingly, the cluster analysis separates the Northern British Isles from the South, and separates both of these from the accents of Scotland, Ulster and Newcastle.

Some early results of the ACCDIST metric were presented at the British Association of Academic Phonetician's conference in Cambridge. You can view the slides used in that talk.

IdioDictionary Research

Another direction of accents research concerns the automatic construction of an IdioDictionary, a pronunciation lexicon containing a set of expected phonetic transcriptions for a single speaker. The challenge is to create such a dictionary after hearing just a small amount of speech form the person. The work is being done by Michael Tjalve, a PhD student in the department. The current approach marks up a standard dictionary with accent features: binary tags that identify a pronunciation as having a specific accent property ("rhotic" for example). We then make decisions about whether the speaker demonstrates that property in his speech from a small sample of known material. The presence or absence of features in the speech allows us to create a dictionary specific to the speaker which covers all words in the language. Speech recognition performance improvements with an idiodictionary are currently modest but show some potential.

Related Bibliography

  1. N. Malayath, H. Hermansky, and A. Kain , "Towards decomposing the sources of variability in speech," in Proc. Eurospeech-97, vol. 1, pp. 497-500, Sept. 1997. PDF.
  2. Z. H. Hu, "Understanding and adapting to speaker variability using correlation-based principal component analysis", Dissertation of OGI. Oct. 10, 1999.
  3. J. J. Humphries and P.C. Woodland, "The Use of Accent-Specific Pronunciation Dictionaries in Acoustic Model Training," in Proc. ICASSP-98, vol.1, pp. 317-320, Seattle, USA, 1998.
  4. C. Teixeira, I. Trancoso and A. Serralheiro, "Accent Identification," in Proc. ICSLP-96, vol.3, pp. 1784-1787, 1996.
  5. J.H.L. Hansen and L.M. Arslan, "Foreign Accent Classification Using Source Generator Based Prosodic Features," in Proc. ICASSP-95, vol.1, pp. 836-839, 1995.
  6. P. Fung and W.K. Liu, "Fast Accent Identification and Accented Speech Recognition," in Proc. ICASSP-99, vol.1, pp. 221-224, 1999.
  7. K. Berkling, M. Zissman, J. Vonwiller and C. Cleirigh, "Improving Accent Identification Through Knowledge of English Syllable Structure," in Proc. ICSLP-98, vol.2, pp. 89-92, 1998.
  8. C. Huang, E. Chang, J.L. Zhou, K.F. Lee, "Accent Modeling Based On Pronunciation Dictionary Adaptation For Large Vocabulary Mandarin Speech Recognition", Vol.3 pp.818-821, ICSLP-2000, Oct. Beijing.
  9. Giles, H. & Powesland, P.F., Speech Style and Social Evaluation, Academic Press, London, 1975.
  10. Wells, J.C., Accents of English, Cambridge University Press, 1982.
  11. Huang, C., Chang, E. & Chen, T., "Accent Issues in Large Vocabulary Continuous Speech Recognition", Microsoft Research China Technical Report, MSR-TR-2001-69, 2001.
  12. Taylor, P.A., and Black, A.W., "Concept-to-speech synthesis by phonological structure matching", Proc. EuroSpeech-99, 623-626, 1999.
  13. Ho, C-H., Vaseghi, S. & Chen, A., "Voice conversion between UK and US accented English", Proc. EuroSpeech-99, 2079-2082, 1999.
  14. Arslan, L.M., & Hansen, J.H.L., "Language accent classification in American English", Speech Communication, 18, 353-367, 1996.
  15. Barry, W.J., Heoquist, C.E. & Nolan, F.J., "An approach to the problem of regional accent in automatic speech recognition", Computer Speech and Language, 3, 355-366, 1989.
  16. Minematsu, N. & Nakagawa, S., "Visualization of Pronunciation Habits Based upon Abstract Representation of Acoustic Observations", Proc. Integration of Speech Technology into Learning 2000, pp.130-137, 2000.
  17. Accents of the British Isles Corpus
  18. Hidden Markov Modelling Toolkit
  19. Speech Filing System Tools

Mark Huckvale Home Page

© 2004 Mark Huckvale University College London