Development >


SpeechHub structure overview

SpeechHub refers to components which implement both speech client / server and self-contained functionalities. For brevity, SpeechHub is abbreviated to SH.

SH receives text from its client or host and, after optional processing, sends it to a synthesizer for conversion to audio. By default, SH expects the audio data to be sent back from the synthesizer so that it can optionally apply processing techniques to optimize fast response after which SH sends the processed audio to the computer's audio system. Alternatively, SH can be configured to expect the synthesizer to send the audio directly to the computer's sound system. As a further option, SH can send the audio received from the synthesizer back to its client or host to handle the audio delivery to the computer's audio system.

In a client / server implementation a client, such as a screen reader or a self-voicing application, communicates with SH using a variety of optional transports such as TCP / IP, stdio or unix sockets. The client server communication is in what we term Enhanced SSIP (Speech Synthesis Interface Protocol). SSIP is the protocol used by the SpeechDispatcher speech server in Linux. The SH Enhanced SSIP or ESSIP differs from the original in a number ways; it only uses a subset of the commands we found necessary for efficient use of SH, it extends some of the SSIP commands to cover options not offered by the original protocol and offers a range of new commands to provide more functionality.

The internal operation of SH can be likened for simplicity to a series of pipes called modules which are assembled together to form a long pipe called a module stack. Each individual module provides a different functionality and text envelopes are inserted into one side of the pipe (module stack) and come out on the other side fully processed. A stack of modules is formed from individual modules based on the required functionality. The functionality required is determined by the functionality required by a client, the synthesizer capabilities and global settings.

SH use the concept of an engine. An engine is a text-to-speech (TTS) system. An engine can connect to a single synthesizer such as in the case of MaryTTS or a number of synthesizers handled by the same system such as in the case of Microsoft Speech API version 5 (SAPI5). The module stack mentioned above connects to an engine which in turn, using a connector, connects to the actual synthesizer using a driver. These drivers are completely separate programs from SH itself and can be written in any computer language.


[ Next - Installation directory, User settings directory and log file ]

[ Previous - Introduction, obtaining the source code and license agreement ]

[ Up - Development - Section ]

[ Up 2 - SpeechHub - TTS server for the vision impaired community - Main Index ]