The results were presented in Rabiner's 1964 MIT dissertation and also described in a 1968 article. This system used a technique of type 5 (rule-generated time functions of acoustic parameters), with a tinge of type 6 (rule-generated articulatory kinematics) in the control of fundamental frequency, based on a concept of subglottal pressure as the crucial variable. Its input was of type 6, consisting of a string of phonemic symbols with stress indications and marks for word boundaries and pauses; thus, it accomplished "speech synthesis proper," with no text analysis component.

A similar process has characterized the improvements in speech synthesis proper, the production of sound from a given phonological string. Today's systems are still based on the same general strategy of phonological units sequenced in time. However, the inventory of units is much larger, each unit typically involving two, three, or more phonetic segments, either as distinguishing context for the unit or as part of the unit itself. Often, the internal structure of each unit is much more elaborate, sometimes including an entire stretch of fully specified speech. The timing rules distinguish many more cases, and the procedures for selecting units, combining them, and establishing their time patterns are often quite complex. Between larger tables of units and more complex combination rules, today's systems simply incorporate much more information than Rabiner's system did. Measured in terms of the size in bits of the programs and tables, today's systems are probably two to three orders of magnitude larger.

As this research goes forward, it faces some pointed questions. What will it take to make synthetic speech that sounds entirely natural, or at least better than word concatenation voice response systems for restricted phrase types such as name and address sequences? Will progress come by a scientific route, through better modeling of human speech production, or by an engineering route, through larger inventories of prerecorded elements with optimal automatic selection and combination methods? How far can we push current ideas about text analysis algorithms? How can we produce more natural-sounding modulation of pitch, amplitude, and timing, and how important are such prosodic improvements relative to segmental improvements?

In Finnish, the text preprocessing scheme is in general easier but contains also some specific difficulties. Especially with numerals and ordinals expansion may be even more difficult than in other languages due to several cases constructed by several different suffixes. The two first ordinals must be expanded differently in some cases and with larger numbers the expansion may become rather complex. With digits, roman numerals, dates, and abbreviations same kind of difficulties are faced as in other languages. For example, for Roman numerals I and III, there is at least three possible conversion. Some examples of the most difficult abbreviations are given in Table 4.1. In most cases, the correct conversion may be concluded from the type of compounding characters or from other compounding information. But to avoid misconversions, some abbreviations must be spelled letter-by-letter.

Text preprocessing is usually a very complex task and includes several language dependent problems (Sproat 1996). Digits and numerals must be expanded into full words. For example in English, numeral 243 would be expanded as and 1750 as (if year) or (if measure). Related cases include the distinction between and . Fractions and dates are also problematic. 5/16 can be expanded as (if fraction) or (if date). Expansion ordinal numbers have been found also problematic. The first three ordinals must be expanded differently than the others, 1st as , 2nd as , and 3rd as . Same kind of contextual problems are faced with roman numerals. Chapter III should be expanded as and Henry III as and may be either a pronoun or number. Roman numerals may be also confused with some common abbreviations, such as MCM. Numbers may also have some special forms of expression, such as 22 as in telephone numbers and 1-0 as in sports.

Most commercial applications so far have been of type 1 or 2. Classical "text-to-speech" systems are of type 4 and/or 6. Ultimate human-computer interaction systems are likely to be of type 5, with a bit of 4. Many of the people closely involved in applying speech synthesis technology think that the most promising current opportunities are of type 3. Note that choosing such restricted-domain applications has been crucial to the success of computer speech recognition.