Tokuda, Statistical parametric speech synthesis, Proc.
Monzo, HMM-based Spanish speech synthesis using CBR as F0 estimator, ISCA Tutorial and Research Workshop on Non Linear Speech Processing (NOLISP07), 2007.
Tokuda, The HMM-based speech synthesis system version 2.0, Proc.
Mehdi, Farsi speech synthesis using hidden Markov model and decision trees, The CSI Journal on Computer Science and Engineering, vol.2, no.1&3(a), 2004 (in Farsi).
Examples of non-real-time but highly accurate intonation control in formant synthesis include the work done in the late 1970s for the toy , and in the early 1980s machines. Creating proper intonation for these projects was painstaking, and the results have yet to be matched by real-time text-to-speech interfaces.
ESCA/COCOSDA Workshop on Speech Synthesis, pp.273-276, Nov.
Unit selection synthesis uses large of recorded speech. During database creation, each recorded utterance is segmented into some or all of the following: individual , , half-phones, , , , , and . Typically, the division into segments is done using a specially modified set to a "forced alignment" mode with some manual correction afterward, using visual representations such as the and . An of the units in the speech database is then created based on the segmentation and acoustic parameters like the (), duration, position in the syllable, and neighboring phones. At , the desired target utterance is created by determining the best chain of candidate units from the database (unit selection). This process is typically achieved using a specially weighted .
Kitamura, Eigenvoices for HMM-based speech synthesis, Proc.
Many systems based on formant synthesis technology generate artificial, robotic-sounding speech that would never be mistaken for human speech. However, maximum naturalness is not always the goal of a speech synthesis system, and formant synthesis systems have advantages over concatenative systems. Formant-synthesized speech can be reliably intelligible, even at very high speeds, avoiding the acoustic glitches that commonly plague concatenative systems. High-speed synthesized speech is used by the visually impaired to quickly navigate computers using a . Formant synthesizers are usually smaller programs than concatenative systems because they do not have a database of speech samples. They can therefore be used in , where and power are especially limited. Because formant-based systems have complete control of all aspects of the output speech, a wide variety of prosodies and can be output, conveying not just questions and statements, but a variety of emotions and tones of voice.
Kitamura, Mixed excitation for HMM-based speech Synthesis, Proc.
refers to computational techniques for synthesizing speech based on models of the human and the articulation processes occurring there. The first articulatory synthesizer regularly used for laboratory experiments was developed at in the mid-1970s by , Tom Baer, and Paul Mermelstein. This synthesizer, known as ASY, was based on vocal tract models developed at in the 1960s and 1970s by Paul Mermelstein, Cecil Coker, and colleagues.
Rashwan, Improving Arabic HMM based speech synthesis quality, Proc.
Concatenative synthesis is based on the (or stringing together) of segments of recorded speech. Generally, concatenative synthesis produces the most natural-sounding synthesized speech. However, differences between natural variations in speech and the nature of the automated techniques for segmenting the waveforms sometimes result in audible glitches in the output. There are three main sub-types of concatenative synthesis.