| ||
OverviewA speaker's accent marks him or her as a member of a group. These groups have been defined by geographical areas, by socio-economic class, by ethnicity, or for second language speakers, by the identity of the speaker's first language. As listeners sensitive to accents we can tell whether a speaker belongs to our group, and as talkers we adapt our speech when we want to belong (or appear to belong) to a different group. For this to be possible it must be the case that accents are stylised patterns of speaking that recur across members of the group. We assume these patterns affect word frequency, the phonological coding used in the pronunciation lexicon, the phonetic realisation of phonological units and the prosody of utterances. |
![]() Click to hear an example sentence spoken in 14 different accents of the British Isles. (Samples from the ABI Corpus). | |
Speech technology has yet to deal adequately with pronunciation variation across accent groups. In speech recognition, a mismatch in accent between the speakers used in testing and training can lead to a 30% increase in word error rate. In speech synthesis, synthetic voices are fixed in one accent due to the increasing use of corpus-based synthesis methods operating from the speech of a single speaker. There are a number of reasons that make accent variation difficult for contemporary speech modelling techniques:
Thus speech technology could benefit from modelling techniques which are sensitive to the particular character of accent variation. Better modelling of accents would allow recognition systems to accommodate speakers from a wide range of accents, including second language speakers. A better understanding of the acoustic-phonetic structure of accents might lead to means for morphing voices across accents which could allow concatenative synthesis systems to speak in multiple accents. Finally, better definitions of accent groups could lead to new sociolinguistic insights into how groups form and change. | ||
Approaches to Modelling an Accent GroupThere have been three basic approaches to modelling pronunciation variation due to accent with the aim of accent characterisation and recognition: Global acoustic distributionThe simplest way to characterise an accent group is to make a model of the probability distribution of the acoustic vectors recorded from a set of speakers from one group. For example, Huang et al modelled four regional accents of Mandarin using a Gaussian mixture model with 32 components to model the pdf of spectral envelope features from 1440 speakers. Accent recognition can then be performed without using a known text or requiring phonetic labelling: Huang et al achieved an accent recognition rate of 85% using gender-dependent models. However, such a global model seems to be a crude way to model differences in phonetics and phonology, particularly when the models also contain other speaker variability. Accent-specific phone modelsHaving known text read by speakers of known accent groups allows the building of a set of phone models for each accent. The models can be used in accent recognition simply by finding which phone set gives the highest probability to any unknown test utterance. For example, Teixeira et al obtained about 65% accent recognition rate for five foreign accented English speaker groups. The weakness of this approach is that phonological variation is not exploited, since the recognisers do not necessarily use the same best phone transcription for the utterance. When the text and a phonological transcription is known, the accent can be found using the same phone sequence for all sets and performance is much higher. For example, Arslan & Hansen obtained a 93% accent recognition rate for four foreign accented English speaker groups. However, such an approach requires that sufficient data be available to build phone models, and that this data come from a range of speakers so as to accommodate speaker variability. Thus it assumes that accent groups are known and that training speakers can be assigned to groups. Analysis of pronunciation systemWhile accent recognition based on accent-specific phone models works well for a small number of varieties of foreign-accented English, it is not clear that the technique would scale well to the problem of dealing with a larger number of more similar regional accents of a language. We believe a more sensitive technique could come from a study of a speaker's pronunciation system rather than his acoustic quality. Barry et al developed a regional accent recognition technique based on acoustic comparisons made within one known sentence. Formant frequency differences between vowels in known words were used to assign the speaker to one of four English regional accents with an accuracy of 74%. Barry's idea to look at the relationship between the realisations of known segments rather than their absolute spectral quality was recently advanced further by the work of Nobuaki Minematsu. His idea was to perform cluster analysis on a set of phone models for a single speaker, then study the resulting phonetic tree to establish the pronunciation habits of the speaker. By this, Minematsu hoped to identify where the speaker's pronunciation differed to some norm. However in our research we take Minematsu's idea a step further, and apply it to the problem of accent characterisation and recognition. We use the similarities between segments to characterise the pronunciation system for a speaker, then compare his pronunciation system with average pronunciation systems for known accent groups to recognise his accent. | ||
Accent Recognition ExperimentsThe accent recognition experiments aim to use the structure of an individual's pronunciation system to help both characterise and identify a speaker's accent. Results so far have been very promising, with accent recognition accuracy approaching 90% for a task involving 14 regional accent groups of British English (a task considerably harder than had been attempted previously). First results of these experiments were presented at the International Conference on Spoken Language Processing, in Korea, October 2004, in a paper called "ACCDIST: a metric for comparing speakers' accents". The abstract of this is below: | ||
|
![]() Results of clustering accents using the ACCDIST metric. Interestingly, the cluster analysis separates the Northern British Isles from the South, and separates both of these from the accents of Scotland, Ulster and Newcastle. | |
Some early results of the ACCDIST metric were presented at the British Association of Academic Phonetician's conference in Cambridge. You can view the slides used in that talk. | ||
IdioDictionary ResearchAnother direction of accents research concerns the automatic construction of an IdioDictionary, a pronunciation lexicon containing a set of expected phonetic transcriptions for a single speaker. The challenge is to create such a dictionary after hearing just a small amount of speech form the person. The work is being done by Michael Tjalve, a PhD student in the department. The current approach marks up a standard dictionary with accent features: binary tags that identify a pronunciation as having a specific accent property ("rhotic" for example). We then make decisions about whether the speaker demonstrates that property in his speech from a small sample of known material. The presence or absence of features in the speech allows us to create a dictionary specific to the speaker which covers all words in the language. Speech recognition performance improvements with an idiodictionary are currently modest but show some potential. | ||
Related Bibliography
|
Mark Huckvale Home Page | © 2004 Mark Huckvale University College London |