Voice in an existing language

The most common case is when someone wants to make their own voice into a synthesizer. Note that the issues in voice modeling of a particular speaker are still open research problems. Much of the quality of a particular voice comes mostly from the waveform generation method, but other aspects of a speaker such as intonation and duration, and pronunciation are all part of what makes that person's voice sound like them. All of the general-purpose voices we have heard in Festival sound like the speaker they were record from (at least as far as we know all the speakers), but they also don't have all the qualities of that person's voice, though they can be quite convincing for limited-domain synthesizers.

As a practical recommendation to make a new speaker in an existing supported language, you will need to consider

the Chapter called US/UK English Diphone Synthesizer deals with specifically building a new US or UK English voice. This is a relatively easy place to start, though of course we encourage reading this entire document.

Another possible solution to getting a new or particular voice is to do voice conversion, as is done at the Oregon Graduate Institute (OGI) [kain98] and elsewhere. OGI have already released new voices based on this conversion and may release the conversion code itself, though the license terms are not the same as those of Festival or this document.

Another aspect of a new voice in an existing language is a voice in a new dialect. The requirements are similar to those of creating a voice in a new language. The lexicon and intonation probably need to change as well as the waveform generation method (a new diphone database). Although much of the text analysis came probably be borrowed, be aware that simple things like number pronunciation can often change between dialects (cf. US and UK English).

We also do work on limited domain synthesis in the same framework. For limited domain synthesis, a reasonably small corpus is collected, and used to synthesize a much larger range of utterances in the same basic style. We give an example of recording a talking clock, which, although built from only 24 recordings, generates over a thousand unique utterances; these capture a lot of the latent speaker characteristics from the data.