Speech synthesis programs convert written input to spoken output
by generating synthetic speech. These are often referred to as Text-to-Speech
conversions (TTS).
There are several ways to perform speech synthesis:
1. Record the voice of a person saying the required phrases
2. The use of algorithms that split speech into smaller pieces.
Often pieces are split into 35-50 phonemes (smallest linguistic
unit). This decreases the quality though, due to the complexity
of combining them once again in a fluent speech pattern.
3. The most developed method is the use of diphones, which splits
phrases not at the transition but at the center of the phonemes,
which leave the transition intact. This results in 400 separate
usable elements and a better quality product.
Performing speech synthesis with the methods above is said to be
using concatenative processes. Concatenative TTS uses human quality
wave files to generate the speech into a TTS string. These systems
can be large in size and require lots of drive space to run, but offer
a more natural sounding output.
Another method, synthesized TTS, creates speech by generating sounds
through a digitized speech format. This output sounds more like a
computer than a human, but can be run using just a few megabytes of
space.
Products, whether concatenative or synthesized, are usually measured
by their intelligibility, naturalness and test preprocessing capabilities
(ability to convert acronyms into normal speech).