Building lexicons for new languages

Traditionally building a new lexicon for a language was a significant piece of work taking several expert phonologists perhaps several years to construct a lexicon with reasonable coverage. However we include a method here that can cut this time significantly using the basic technology provided with this documentation.

The basic idea is add the most common words to a lexicon, expclitly giving their pronunciation by hand, then automatically build letter to sound rules from the initial data. Then finding the most common words submit them to the system and check their correctness. If wrong they are corrected and added to the lexicon, if correct they are added to the lexicon as is. Over multiple passes the lexicon and letter to sound rules will improve. As each pass the letter to sound rules are re-generate with the new data making them more correct.

This tecynique has been proved succesful for a number of language cutting the amount to time and effort to perhaps checking thousands of words rather than tens of thousands of words. It also is a structured method that requires only knowledge of the basic language to carry out. Good lexicons can be generated in as little as a coupld of weeks, though to get greater than 95% correctness of words in a language could still take several months work.

As stated above you can never list all the words in a language, but having grateter than 95% coverage with letter to sound rule accuracy grater than 75% you will have a lexicon that is competitive with those that take many year build. In fact because you can build a lexicon in a shorter time it more likely to be consistent and there better for synthesis.