Only within the past few years have we seen a general use of systematic optimization techniques for purposes of inventory design, unit segmentation, unit selection, and unit combination algorithms. The general approach is to define a perceptually reasonable acoustic distortion metric and use it in a global comparison of alternatives (in allophonic clustering, in segmentation points, in unit selection, or whatever). To make this method work effectively, one must usually design the overall system specifically with such a process in view. Psychological tests would be the optimal basis of such an effort, but objective (if psychologically motivated) distortion metrics have the advantage of being quicker and cheaper. Although such objective distortion metrics are far from a perfect image of the human judgments that provide the ultimate evaluation of any synthesis system, they usually provide the only feasible way to perform the massive and systematic comparison of alternatives that is needed. Testing with

Considerable success has been achieved by systems that base sound generation on concatenation of natural speech units (Moulines et al., 1990). Sophisticated techniques have been developed to manipulate these units, especially with respect to duration and fundamental frequency. The most important aspects of prosody can be imposed on synthetic speech without considerable loss of quality. The pitch-synchronous overlap-add approach (PSOLA) (Charpentier and Moulines, 1990) methods are based on concatenation of waveform pieces. The frequency domain approach (FD-PSOLA) is used to modify the spectral characteristics of the signal; the time domain approach (TD-PSOLA) provides efficient solutions for real-time implementation of synthesis systems. Earlier systems like SOLA (Roucos and Wilgus, 1985) and systems for divers' speech restoration also did direct processing of the waveform (Liljencrants, 1974).

Natural Speech Technology (NST) is an EPSRC Programme Grant, involving the Universities of Edinburgh, Cambridge and Sheffield. Its objective is to significantly advance the state-of-the-art in speech technology by making it more natural, approaching human levels of reliability, adaptability and conversational richness. NST starts in May 2011, and has a duration of 5 years.

It is not an easy task to place different synthesis methods into unique classes. Some of the common "labels" are often used to characterize a complete system rather than the model it stands for. A rule-based system using waveform coding is a perfectly possible combination, as is speech coding using a terminal analog or a rule-based diphone system using an articulatory model. In the following pages, synthesis models will be described from two different perspectives: the sound-generating part and the control part of the system.

technology. There are a number of new ideas at all levels of the problem and also a more general sense that a methodology similar to the one that has worked so well in speech recognition research will also raise speech synthesis quality to a new level.


Fundamental frequency or intonation contour over the sentence is important for correct prosody and natural sounding speech. The different contours are usually analyzed from natural speech in specific situations and with specific speaker characteristics and then applied to rules to generate the synthetic speech. The fundamental frequency contour can be viewed as the composite set of hierarchical patterns shown in Figure 3.4. The overall contour is generated by the superposition of these patterns (Sagisaga 1990). Methods for controlling the fundamental frequency contours are described later in .

The traditional source model for the voiced segments has been a simple or double impulse. This is one reason why text-to-speech systems from the 1980s have had serious problems, especially when different voices are modeled. While the male voice sometimes has been regarded to be generally acceptable, an improved glottal source will open the way to more realistic synthesis of child and female voices and also to more naturalness and variation in male voices.