US/UK English Diphone Synthesizer

When building a new diphone based voice for a supported language, such as English, the upper parts of the systems can mostly be taken from existing voices, thus making the building task simpler. Of course, things can still go wrong, and its worth checking everything at each stage. This section gives the basic walkthrough for building a new US English voice. Support for building UK (southern, RP dialect) is also provided this way. For building non-US/UK synthesizers see the Chapter called A Japanese Diphone Voice for a similar walkthrough but also covering the full text, lexicona nd prosody issues which we can subsume in this example.

Recording a whole diphone set usually takes a number of hours, if everything goes to plan. Construction of the voice after recording may take another couple of hours, though much of this is CPU bound. Then hand-correction may take at least another few hours (depending on the quality). Thus if all goes well it is possible to construct a new voice in a day's work though usually something goes wrong and it takes longer. The more time you spend making sure the data is correctly aligned and labeled, the better the results will be. While something can be made quickly, it can take much longer to do it very well.

For those of you who have ignored the rest of this document and are just hoping to get by by reading this, good luck. It may be possible to do that, but considering the time you'll need to invest to build a voice, being familar with the comments, at least in the rest of this chapter, may be well worth the time invested.

The tasks you will need to do are:

As with all parts of festvox, you must set the following environment variables to where you have installed versions of the Edinburgh Speech Tools and the festvox distribution

export ESTDIR=/home/awb/projects/speech_tools
export FESTVOXDIR=/home/awb/projects/festvox

The next stage is to select a directory to build the voice. You will need in the order of 500M of diskspace to do this, it could be done in less, but its better to have enough to start with. Make a new directory and cd into it

mkdir ~/data/cmu_us_awb_diphone
cd ~/data/cmu_us_awb_diphone

By convention, the directory is named for the institution, the language (here, us English) and the speaker (awb, who actually speaks with a Scottish accent). Although it can be fixed later, the directory name is used when festival searches for available voices, so it is good to follow this convention.

Build the basic directory structure

$FESTVOXDIR/src/diphones/setup_diphone cmu us awb

the arguments to setup_diphone are, the institution building the voice, the language, and the name of the speaker. If you don't have a institution we recommend you use net. There is an ISO standard for language names, though unfortunately it doesn't allow distinction between US and UK English, so in general we recommend you use the two letter form, though for US English use us and UK English use uk. The speaker name may or may nor be there actual name.

The setup script builds the basic directory structure and copies in various skeleton files. For languages us and uk it copies in files with much of the details filled in for those languages, for other languages the skeleton files are much more skeletal.

For constructing a us voice you must have the following installed in your version of festival

festvox_kallpc16k
festlex_POSLEX
festlex_CMU

And for a UK voice you need

festvox_rablpc16k
festlex_POSLEX
festlex_OALD

At run-time the two appropriate festlex packages (POSLEX + dialect specific lexicon) will be required but not the existing kal/rab voices.

To generate the nonsense word list

festival -b festvox/diphlist.scm festvox/us_schema.scm \
     '(diphone-gen-schema "us" "etc/usdiph.list")'

We use a synthesized voice to build waveforms of the prompts, both for actual prompting and for alignment. If you want to change the prompt voice (e.g. to a female) edit festvox/us_schema.scm. Near the end of the file is the function Diphone_Prompt_Setup. By default (for US English) the voice (voice_kal_diphone) is called. Change that, and the F0 value in the following line, if appropriate, to the voice use wish to follow.

Then to synthesize the prompts

festival -b festvox/diphlist.scm festvox/us_schema.scm \
      '(diphone-gen-waves "prompt-wav" "prompt-lab" "etc/usdiph.list")'

Now record the prompts. Care should be taken to set up the recording environment so it is best. Note all power levels so that if more than one session is required you can continue and still get the same recording quality. Given the length of the US English list, its unlikely a person can say allow of these in one sitting without taking breaks at least, so ensuring the environment can be duplicated is important, even if it's only after a good stretch and a drink of water.

bin/prompt_them etc/usdiph.list

Note a third argument can be given to state which nonse word to begin prompting from. This if you have already recorded the first 100 you can continue with

bin/prompt_them etc/usdiph.list 101

See the Section called US phoneset in the Chapter called English phone lists for notes on pronunciation (or the Section called UK phoneset in the Chapter called English phone lists for the UK version).

The recorded prompts can the be labeled by

bin/make_labs prompt-wav/*.wav

Its is always worthwhile correcting the autolabeling. Use

emulabel etc/emu_lab

and select FILE OPEN from the top menu bar and the place the other dialog box and clink inside it and hit return. A list of all label files will be given. Double-click on each of these to see the labels, spectragram and waveform. (** reference to "How to correct labels" required **).

Once the diphone labels have been corrected, the diphone index may be built by

bin/make_diph_index etc/usdiph.list dic/awbdiph.est

If no EGG signal has been collected you can extract the pitchmarks by (though read the Section called Extracting pitchmarks from waveforms in the Chapter called Basic Requirements to ensure you are getting the best exteraction).

bin/make_pm_wave wav/*.wav

If you do have an EGG signal then use the following instead

bin/make_pm lar/*.lar

A program to move the predicted pitchmarks to the nearest peak in the waveform is also provided. This is almost always a good idea, even for EGG extracted pitch marks

bin/make_pm_fix pm/*.pm

Getting good pitchmarks is important to the quality of the synthesis, see the Section called Extracting pitchmarks from waveforms in the Chapter called Basic Requirements for more discussion.

Because there is often a power mismatch through a set of diphone we provided a simple method for finding what general power difference exist between files. This finds the mean power for each vowel in each file and calculates a factor with respect to the overall mean vowel power. A table of power modifiers for each file can be calculated by

bin/find_powerfactors lab/*.lab

The factors calculated by this are saved in etc/powfacts.

Then build the pitch-synchronous LPC coefficients, which use the power factors if they've been calculated.

bin/make_lpc wav/*.wav

Now the database is ready for its initial tests.

festival festvox/cmu_us_awb_diphone.scm '(voice_cmu_us_awb_diphone)'

When there has been no hand correction of the labels this stage may fail with diphones not having proper start, mid and end values. This happens when the automatic labeled has position two labels at the same point. For each diphone that has a problem find out which file it comes from (grep for it in dic/awbdiph.est and use emulabel to change the labeling to as its correct. For example suppose "ah-m" is wrong you'll find is comes from us_0314. Thus type

emulabel etc/emu_lab us_0314

After correcting labels you must re-run the make_diph_index command. You should also re-run the find_powerfacts stage and make_lpc stages as these too depend on the labels, but this takes longer to run and perhaps that need only be done when you've corrected many labels.

To test the voice's basic functionality with

festival> (SayPhones '(pau hh ax l ow pau))

festival> (intro)

As the autolabeling is unlikely to work completely you should listen to a number of examples to find out what diphones have gone wrong.

Finally, once you have corrected the errors (did we mention you need to check and correct the errors?), you can build a final voice suitable for distribution. First you need to create a group file which contains only the subparts of spoken words which contain the diphones.

festival festvox/cmu_us_awb_diphone.scm '(voice_cmu_us_awb_diphone)'
...
festival (us_make_group_file "group/awblpc.group" nil)
...

The us_ in this function name confusingly stands for UniSyn (the unit concatenation subsystem in Festival) and nothing to do with US English.

To test this edit festvox/cmu_us_awb_diphone.scm and change the choice of databases used from separate to grouped. This is done by commenting out the line (around line 81)

(set! cmu_us_awb_db_name (us_diphone_init cmu_us_awb_lpc_sep))

and uncommented the line (around line 84)

(set! cmu_us_awb_db_name (us_diphone_init cmu_us_awb_lpc_group))

The next stage is to integrate this new voice so that festival can find it automatically. To do this, you should add a symbolic link from the voice directory of Festival's English voices to the directory containing the new voice. First cd to festival's voice directory (this will vary depending on where you installed festival)

cd /home/awb/projects/festival/lib/voices/english/

add a symbolic link back to where your voice was built

ln -s /home/awb/data/cmu_us_awb_diphone

Now this new voice will be available for anyone runing that version festival (started from any directory)

festival
...
festival> (voice_cmu_us_awb_diphone)
...
festival> (intro)
...

The final stage is to generate a distribution file so the voice may be installed on other's festival installations. Before you do this you must add a file COPYING to the directory you built the diphone database in. This should state the terms and conditions in which people may use, distribute and modify the voice.

Generate the distribution tarfile in the directory above the festival installation (the one where festival/ and speech_tools/ directory is).

cd /home/awb/projects/
tar zcvf festvox_cmu_us_awb_lpc.tar.gz \
  festival/lib/voices/english/cmu_us_awb_diphone/festvox/*.scm \
  festival/lib/voices/english/cmu_us_awb_diphone/COPYING \
  festival/lib/voices/english/cmu_us_awb_diphone/group/awblpc.group

The complete files from building an example US voice based on the KAL recordings is available at http://festvox.org/examples/cmu_us_kal_diphone/.