Mark Huckvale - Synthesis Research


Overview

In my research I take the view that speech technology is a tool to help us understand better how humans use spoken language to communicate, rather than as an end in itself. My research work on speech synthesis has focussed on issues such as: how to exploit recent phonological models of English in synthesis, how to use synthetic speech to test the relative intelligibility of accents, or how to teach an articulatory synthesizer to imitate speech. What is important is what we can learn about how humans plan, perceive and acquire speech through the manipulation and generation of speech signals using a computer.

You can read more about my analysis of the current state of speech synthesis research in Huckvale (2002).

In this page, I summarise some of my previous and ongoing research work in speech synthesis. You will find information about projects, papers and software. Elsewhere on the web site you can read about other research work in speech synthesis in the department.

Current Projects

Teaching an articulatory synthesizer to imitate speech

This work is being undertaken with Ian Howard, of the Sobell Department of the Institute of Neurology at UCL.

The idea behind the work is that the control of an articulatory synthesizer is too difficult to be programmed from information gained from the study of human speech production. In other words we should not expect to be able to write a computer program that converts phonological representations to articulatory control parameters through the use of rules obtained from studying how humans do it. If you think about it, such a task is even more difficult than that facing an infant learner. The fact that infants can imitate speech without knowing any articulatory rules tells us that the task should be soluble using only (i) an articulatory synthesizer, (ii) auditory analysis, and (iii) general learning principles.

So far in this work we have been investigating neural network models for the mapping between auditory representations and articulatory representations. These models are not trained using privileged information about articulator use but by exploring the space of output possibilities available to the synthesizer through babbling. You can read about some of the early work in the paper by Howard & Huckvale (2004) and you can hear some babbling and some imitated phrases on Ian's web site.


Wideband spectrograms for sung input utterance "I'm half crazy, all for the love of you", taken from the song Daisy, for a male speaker (A), and re-synthesised outputs generated by the direct (B) and distal retrained (C) imitation system.

More information coming soon ...

Previous Projects

ProSynth

This work was done in collaboration with Jill House and others.

ProSynth was a joint project with the University of Cambridge and the University of York funded by the EPSRC. Its focus was on the exploitation of non-linear hierarchical prosodic phonological structures in speech synthesis. The UCL part of the project concerned the intonantion module and the underlying computational infrastructure. See Hawkins et al (2000) for overview.

In our intonation work we were concerned with how to represent pitch accents used in reading a text within a hierarchical phonological representation. This has involved studying prosodic phrasing (breaking text into intonational phrases) and the categorisation and assignment of pitch accents within the phrase. Our work on the phonetic interpretation of these structures involved the modelling of fundamental frequency contours and predicting the durations of syllabic constituents as a function of the segmental content and the phrase context.


The ProSynth windows application runs the all-prosodic synthesis system on any PC

In our computational infrastructure work we made extensive use of XML to mark-up linguistic representations: particularly hierarchical phonological representations and their phonetic interpretation. The ProSynth tools converted text to a hierarchical phonological form expressed in XML. Scripts interpret this form by fleshing out durations, fundamental frequency and segmental quality in context. The scripting language ProXML was designed to make declarative formulation of such knowledge easy to state. See Huckvale (1999).

The use of XML for mark-up within ProSynth predated much further work in the design of speech synthesis mark-up languages for text. A commentary on these is available in Huckvale (2001).

Quality evaluation

This work was done in collaboration with Yolanda Vasquez-Alvarez.

An evaluation of the reliability of the ITU-T P.85 recommended standard for the evaluation of voice output systems was conducted using six English TTS systems. The P.85 standard is based on mean-opinion-score judgements of a listening panel on a number of rating scales. The study looked at how the ranking of the six systems on the scales varied across four different text genres and across two listening sessions. Rankings were also compared with a much simpler pair-comparison test across genres and listening sessions. For the ITU test a large degree of correlation was found across scales, implying that these were not really testing different aspects of the systems. There were surprisingly similar results across sessions, implying that listeners were indeed making real judgements. In comparison, the pair comparison test gave (almost) identical rankings for systems with far less variability, making statistically significant comparisons between systems possible, even across genres.



Soundjudge is a program to run the ITU P.85 assessment method for speech evaluation

Korean timing model

This work was done in collaboration with Hyunsong Chung.

The work studied the phonetic and phonological factors affecting the rhythm and timing of spoken Korean. Stepwise construction of a CART model was used to uncover the contribution and relative importance of phrasal, syllabic, and segmental contexts. The model was trained from a corpus of 671 read sentences, yielding 42,000 segments each annotated with 69 linguistic features. On reserved test data, the best model showed a correlation coefficient of 0.73 with a RMS prediction error of 26 ms. Analysis of the classification tree during and after construction shows that phrasal structure had the greatest influence on segmental duration. Strong lengthening effects were shown for the first and last syllable in the accentual phrase. Syllable structure and the manner features of surrounding segments had smaller effects on segmental duration. See Chung & Huckvale (2001) for details.

Software

Prosynth Deliverables
From the ProSynth deliverables web page you can download various speech synthesis data sets, and also the ProSynth Windows application.
VTDEMO
VTDemo is an interactive Windows PC program for demonstrating how the quality of different speech sounds can be explained by changes in the shape of the vocal tract. With VTDemo you can move the articulators in a 2D simulation of the vocal tract cavity and hear in real-time the consequences on the sound produced.
PhonWeb: Spoken Phonetic Transcription
This is a web-based system for replaying SAMPA coded English phonemic transcription using a a diphone synthesis method.
Speech Filing System (SFS)
SFS is a set of tools for speech research which also incorporates many elements relating to speech synthesis. These include a diphone synthesis by rule program, a formant synthesis by rule program, and a software formant synthesizer.

Recent Articles

S. Hawkins, J. House, M. Huckvale, J. Local, R. Ogden, "ProSynth: An integrated prosodic approach to device-independent, natural-sounding speech synthesis", Proc. ICSLP, Sydney, 1998. Download PDF.

M.Huckvale, "Representation and processing of linguistic structures for an all-prosodic synthesis system using XML", Proc. EuroSpeech99, Budapest, Hungary, 1999. Download PDF.

Ogden, R., Hawkins, S., House, J., Huckvale, M., Local, J., Carter, P., Dankovicova, J., Heid, S., "ProSynth: an Integrated Prosodic Approach to Device-Independent Natural-Sounding Speech Synthesis", in Computer Speech and Language, 14 (2000), 177-210. Read at Idea Library.

Hawkins, S., Heid, S., House, J., Huckvale, M., "Assessment of Naturalness in the ProSynth Speech Synthesis Project", IEE Workshop on Speech Synthesis, London, May 2000. Download PDF.

M.Huckvale, "The Use and Potential of Extensible Mark-Up (XML) in Speech Generation", in Keller et al, Improvements in Speech Synthesis, Wiley, 2001. [ISBN: 0471499854] Available at Amazon.com.

Chung, H., Huckvale, M., "Linguistic factors affecting timing in Korean with application to speech synthesis", in Proc. EuroSpeech 2001, Aalborg, Denmark, Vol 2, pp815-818. Download PDF.

Huckvale, M., "Speech Synthesis, Speech Simulation and Speech Science", Proc. International Conference on Speech and Language Processing, Denver, 2002, pp1261-1264. Download PDF.

Vazquez-Alvarez, Y., Huckvale, M., "The Reliability of the ITU-P.85 Standard for the Evaluation of Text-to-Speech Systems", Proc. International Conference on Speech and Language Processing, Denver, 2002, pp329-332. Download PDF.

Howard, I., Huckvale, M., "Learning to control an articulatory synthesizer through imitation of natural speech", Summer School on Cognitive and physical models of speech production, perception and perception-production interaction, Lubin Germany, Sept 2004. Web site.


Mark Huckvale Home Page

© 2005 Mark Huckvale University College London