gnuspeech project history

Brief gnuspeech History

The current state and a little history of the project are as follows. The descriptions also provides a reference for the original system components that are available in the NeXT source archive in the project SVN Repository, and to understand the scope of gnuspeech.

Monet:
The interactive language database creation and testing tool used to create the original databases for English text-to-speech conversion using the new articulatory model of the vocal tract (the TRM). The ported version of Monet translates symbolic or punctuated text input strings into the digital waveform needed to “speak” the input text (since it now also uses Parser, shown only as part of the TextToSpeech Server in the overview figure on the project Home Page). It also provides interactive facilities to create and modify the database (rules, posture-specifications, and the like) that are used by the TextToSpeech Server. Finally it provides facilities for experimenting with and controlling the parameters that manage the intonation as well as allowing the contour to be edited. Although the code and interfaces exist for the database editing and intonation experiments, that prime function of Monet is not yet fully operational and is the most important task to finish to support Monet’s real purpose. The work mostly involves adding the file writing. The speech output and parameter displays, as well as the text parsing and application of the rhythm and intonation rules are complete. Monet was originally designed and developed by Craig Schock to create the databases need to support David Hill’s “event-based” approach to speech synthesis, with testing and suggestions for improvements by David Hill and Leonard Manzara. Is was the essential in-house tool for the original Trillium development effort. A valuable enhancement of Monet would be a meta-layer to allow TRM posture data to be specified in terms of such linguistically oriented parameters as lip opening, jaw rotation and tongue height, and which would then translate these high-level parameters into the current “low-level” TRM section radius data.
The Tube Resonance Model (TRM) (and associated TRM Control Model):
This is a ‘C’ implementation of the tube model that forms the core of the synthesis system, and was created by Leonard Manzara, who also ported it to the DSP56001 signal processor and made it run in real-time on the NeXT. It is based on work by Perry Cook and Julius Smith at the Stanford University Center for Computer Research in Music and Acoustics (CCRMA). The TRM and associated TRM Control Model are both fully ported and working—supporting Monet and the TextToSpeech Server. The TRM without the TRM Control Model is used by TRAcT. The TRM comprises just ten oral tube sections, with a splitting junction at the velum, and five nasal sections. The TRM Control Model configures the ten oral sections into the eight regions of different length required by the DRM. This is an approximation, but close and very effective. Thirty regions (requiring considerably more computation) could be grouped into an exact rendition of the DRM. A description of the TRM is available here (Use the back button to return).
TRAcT: (originally named Synthesizer
This is not a text-to-speech synthesiser! It is a GUI application that allows a user, usually a language developer or someone interested in the behaviour of the TRM, to interact directly with the TRM, listen to the output under different static conditions, and analyse the output. It was an important tool used in developing the databases for Trillium’s original English text-to-speech system because it allowed the tube configurations needed to define the speech “postures” (of the vocal tract) to be explored and finalised. Although it does have built-in analysis and display features, it was also used in conjunction with a Kay Sona-Graf spectrum analyser in order to compare the spectral analyses of sounds produced by proposed “postures” with what was seen in natural speech. The Sona-Graf was also used to check the synthetic speech output from Monet against the same utterances in natural speech. TRAcT has been mostly ported to the Mac under OS/X by David Hill. Not all the display functions are working and some clean up is needed, but it is basically functional. The original version of TRAcT (Synthesizer) was created (for the NeXT) by Leonard Manzara.
PrEditor:
PrEditor is an application to allow users to create and maintain their own dictionaries. The original TextToSpeech Kit looks up several dictionaries in the order User → Application → Main. PrEditor allows the User and Application dictionaries to be created and maintained by the user or application developer respectively. An initial port was begun by Eric Zoerner but is not yet functional. The original PrEditor on the NeXT was written by Vince DeMarco and David Marwood, documented by Leonard Manzara and later upgraded by Michael Forbes.
The Main Dictionary:
This has not really changed since the original NeXT implementation and is used by the Parser (and hence is also used indirectly by Monet). It is a compromise pronunciation between British (RP) English—mainly the vowels and related stuff, and General American—especially the rhotic “r” sound. It includes more than 70,000 words, plus facilities for creating/checking derivatives such as plurals and adverbs, plus information concerning word stress and part-of-speech. The part-of-speech information is not used, but is included to allow future development of grammatical processing. The main dictionary was compiled mainly by David Hill, after a preliminary version plus creation tools were set up by Craig Schock. It would be well worth creating two versions, one for British English and the other for American English, though the current combined version is surprisingly acceptable, especially given the excellent rhythm and intonation.
BigMouth:
This is an application that allows text-to-speech to be tried out without reference to any particular application on the NeXT and also drives the “Speech Service”. There is speech service on the OS X version, that has to be installed, and is activated through the OS X “Services Preferences … ” menu item under the [Application Name] → Services menu in the top menu bar. It uses the TextToSpeech Server. The original source for BigMouth was created by Leonard Manzara. There was another applet with a similar name on the NeXT that has nothing whatever to do with gnuspeech.
The TextToSpeech Server (TTS Server):
The original NeXT-based TextToSpeech Kit came in three versions. The User Kit which simply provided speech output as a service available to any application; the Developer Kit which provided the means to incorporate speech into applications directly; and the Experimenter Kit which allowed full access to all the tools used by Trillium in developing language databases including dictionaries. All of these used the TextToSpeech Server for the actual conversion of text to speech output. This real-time task was made possible on the NeXT, which is relatively slow by today’s standards, by using the built-in DSP (a Motorola DSP-56001). In the Mac OS X and GNU/Linux GNUStep versions of Monet and Synthesizer, the host computer performs all the computation—as CPU speeds are at least two orders of magnitude faster than the old NeXT. The use of the DSP on the NeXT gave a clear separation between the text-to-speech tasks associated with creating the event framework for synthesis, and the those associated with transforming the event framework into the digital speech waveform, and outputting it—the latter tasks being carried out by the DSP-based version of the TRM and TRM Control Model. Thus the TRM ran on the DSP in real-time and communicated by DMA access. There was also a ‘C’ version of the tube model which could not run in real-time—what we called the Software Synthesiser. It was useful for producing a slightly higher quality of speech since it did not have to be squeezed into the DSP and rigorously optimised because of the marginal ability (even on the DSP) to run in real-time. This ‘C’ version of the TRM is what forms the basis of the current ports—possible now because of the greatly increased processor speeds over the last 20 years. The TextToSpeech Server is complete.

No database creation and manipulation components or interactive interfaces are provided for the TextToSpeech Server itself. Those are only appropriate for Monet and other applications that use it. However, provision is made to set the parameters for controlling static aspects of the synthesis (tube length, mean pitch, and so on—the so-called “utterance-rate parameters”). These static parameters are normally held in a system library as a “defaults database”. This refinement is not yet included in the ports but is a function of ServerTest (see below). The Text-to-Speech Server computes the event framework from the input text via the intermediate input syntax produced by the Parser. This pre-processing includes dictionary look-up to get the correct pronunciation. There is no significant parsing in terms of normal English grammar, and no attempt is made to determine meaning (which would allow different pronunciations of words with the same spelling to be disambiguated, and would to allow slightly more accurate rhythm and intonation to be generated). Such abilities should eventually be added. The word stress information from the dictionary is used to help determine the rhythmic framework according to the Jones/Abercrombie/Halliday (British) “tendency-towards-isochrony” theory of British English speech by placing “foot” boundaries before the word stress in words having word-stressed syllables. “” implies equal length rhythmic units but, in practice, there is considerable difference in their length, depending on the stress and the number of postures involved. However, longer stressed rhythmic units are found to be somewhat shorter than would be expected from their component postures. This is what is meant by the “isochrony effect”. The normal punctuation is also used in this process, which allows a distinction to be made between statements, emphatic statements, questions, and questions expecting a yes/no answer, for the purpose of selecting different intonation contours. The rhythm and intonation models are very effective, producing speech that is pleasant to listen to. However, without using knowledge of meaning, it is hard to decide on the placement of the tonic (information point) of the phrase or sentence, or to disambiguate words which are spelled the same but pronounced differently. As a default, the tonic is placed in phrase/sentence-final position. This causes some undesirable compromise of the speech rhythm and intonation when it is wrong. Some measure of grammatical analysis and understanding would be a most effective refinement of the current system. Work on the rhythm and intonation was carried out by a number of contributors, including David Hill, Ian Witten, Neal Reid, and Wiktor Jassem, based on the published work of others, as well as on in-house experiments.
There's a diagram of the relationships between the various TTS components of the complete system on the project Home Page.
ServerTest and ServerTestPlus:
These applications were designed to allow the TextToSpeech Server to be tested and, in the case of the TextToSpeech Server Plus, provide certain “hidden” methods that were restricted to Trillium's “in-house” use. Now that the whole system is available under a GPL, the restricted “ServerTest” version is obsolete and the name ServerTest will refer to a reimplementation of ServerTestPlus. For example, one of the 18 originally-hidden methods allowed plain text to be converted into the intermediate Monet input syntax. The ported version of gnuspeech incoporates use of the Parser in various accessible roles. It was originally hidden to keep the main dictionary material proprietary, as it could have been used to completely decode the encoded dictionary.
WhosOnFirst:
WhosOnFirst was the first publicly available software associated with the Trillium TextToSpeech system and was designed as a bit of a teaser. As issued, it provided indication, on the NeXT console, of remote logins. It also told the user that, if they had the Trillium TextToSpeech system, they could get voice alerts not only to remote logins, but other system activity such as application launches. WhosOnFirst was written by Craig Schock and was instrumental in catching and identifying a hacker trying to break into our system soon after it was set up. WhosOnFirst has not yet been ported.
say:
A command line interface to the TextToSpeech Server that can be used from a terminal or in shell scripts. It was written by Craig Schock and has not been ported yet, though there is a similar facility for the GNU/Linux GNUStep version.
SpeechManager:
The SpeechManager was provided to allow the TextToSpeech Server operating system parameters to be optimised for different systems, because no particular setting of priorities, initial silence fill, and so on, could be right for all systems. In particular, in networked systems, or systems with a high compute load from other tasks, the speech would sometimes crackle due to interference from other tasks. The SpeechManager, which could only be run as root, allowed the TextToSpeech Server to be restarted, and the various parameters controlling priority and so on to be set to new values to avoid crackling whilst minimising the use of system resources. These functions are almost certainly obsolete these days, given the increased compute-power available. Some functions (such as reporting the version of the main dictionary in use, or restarting the TextToSpeech Server) may still be required in some form. The SpeechManager was written by Craig Schock. It has not been ported.
SpeechRegistrar:
An applet that was provided to allow any of the TextToSpeech Kits to be registered, using a password, and was run under the root account. The original function is now obsolete, but may be useful, in revised form, as a way of building user groups for the ported system. It was written by Craig Schock. It has not been ported.
TrilliumSoundEditor:
The TrilliumSoundEditor is speech editor and analysis program intended to provide a more versatile replacement for the publicly available Sonagram program written by Hiroshi Momose. Although TrilliumSoundEditor was never completely finished, it provided the basic spectrographic analysis functionality required for speech development and could be finished/upgraded/ported at some point in the future. The program was written by Craig Schock. None of the TrilliumSoundEditor has yet been ported, but the source is available. With the advent of Praat (see the Monet manual in the initial distribution of gnuspeech), the Trillium Sound Editor is probably redundant.

In summary, much of the core software has been, and some is being ported to the Mac under OS/X, and GNU/Linux under GNUStep. All sources and builds for the current work are currently in the Git repository, with older material in the SVN repository under three branches (for the Next, Mac OS X, and GNU/Linux under GNUStep versions—see below). Speech may be produced from input text. The development facilities for managing and creating new language databases, or modifying the existing English database for text-to-speech lack mainly the file writing components. The gnuspeech facilities also provide the tools needed for psychophysical and linguistic experiments. TRAcT, which gives direct access to the tube model, functional—a few of the logarithmic data displays remain to be finished, and clean-up is needed. Some accessory tools are available. As well as the acknowledgements above, Greg Casamento, Adam Fedor and the Savannah Hackers provided valuable support getting the gnuspeech project established, as well as initial work that facilitated the port, including making ubiquitous and tedious changes to the entire NeXT source code to bring it up to OpenStep standards. This work and support is gratefully acknowledged. It involves a lot of effort but is largely invisible to all but the developers involved, and made the actual port to OS X and GNUStep much less painful.

Last modified: Sun Oct 18 20:31:41 PDT 2015