Basic Requirements

This section identifies the basic requirements for building a voice in a new language, and adding a new voice in a language already supported by Festival.

Hardware/software requirements

Because we are most familiar with a Unix environment the scripts, tools etc. assume such a basic environment. This is not to say you couldn't run these scripts on other platforms as many of these tools are supported on platforms like WIN32, its just that in our normal work environment, Unix is ubiquitous and we like working in it. Festival also runs on Win32 platforms.

Much of the testing was done under Linux; wherever possible, we are using freely available tools. We are happy to say that no non-free tools are required to build voices, and we have included citations and/or links to everything needed in this document.

We assume Festival 2.4 and the Edinburgh Speech Tools 2.4.

Note that we make an extensive use of the Speech Tools programs, and you will need the full distribution of them as well as Festival, rather than the run-time (binary) only versions which are available for some Linux platforms. If you find the task of compiling Festival and the speech tools daunting, you will probably find the rest of the tasks specified in this document more so. However, it is not necessary for you to have any knowledge of C++ to make voices, though familiarity with text processing techniques (e.g. awk, sed, perl) will make understanding the examples given much easier.

We also assume a basic knowledge of Festival, and of speech processing in general. We expect the reader to be familiar with basic terms such as F0, phoneme, and cepstrum, but not in any real detail. References to general texts are given (when we know them to exist). A basic knowledge of programming in Scheme (and/or Lisp) will also make things easier. A basic capability in programming in general will make defining rules, etc., much easier.

If you are going to record your own database, you will need recording equipment: the higher quality, the better. A proper recording studio is ideal, though may not be available for everyone. A cheap microphone stuck on the back of standard PC is not ideal, though we know most of you will end up doing that. A high quality sound board, close-talking, high quality microphone and a nearly soundproof recording environment will often be the compromise between these two extremes.

Many of the techniques described in here require a fair amount of processing time to achieve, though machines are indeed getting faster and this is becoming less of an issue (though we always find ways to use up the more CPU time that is available). If you use the provided aligner for labeling diphones you will need a processor of reasonable speed, likewise for the various training techniques for intonation, duration modeling and letter-to-sound rules. Nothing presented here takes weeks though a number of processes may be over-night jobs, depending on the speed of your machine, and size of your database.

Also we think that you will need a little patience. The process of building a voice is not necessarily going to work first time. It may even fail completely, so if you don't expect anything special, you wont be disappointed.