What can we do if we want to grant the power of speech to our machines and tools?
"Speech is the mirror of the soul; as a man speaks, so is he" - Publilius Syrus Roman author, 1st century B.C.
We know there is something special about speech. Our voices are not just a means of communicating, although they are superb at communicating, they also give a deep impression of who we are. They can betray our upbringing, our emotional state, our state of health. They can be used to persuade and convince, to calm and to excite. So, what can we do if a person loses the power of speech? What can we do if we want to grant the power of speech to our machines and tools? The answers lie with speech synthesis technology.
Speech synthesis has progressed enormously since the trademark Stephen Hawking voice which was based on synthesis developed in the mid-eighties. Free with Apple Leopard and Microsoft Vista you can find good quality synthetic voices (among some startlingly not so good ones!). These voices have a neutral speaking style and are an example of unit selection or concatenative synthesis. In simple terms, the synthetic speech is made from taking lots of small pieces of speech, taken from recordings of a human voice, and sticking them together in order to create the required series of sounds, intonation and voice quality for a new message. Such synthesis systems have four main components, a large database of recordings in the order of 3-5 hours of speech, a set of features that describe a new phrase or sentence, a search algorithm that finds the best pieces of speech in the database to match these features, and a method to smoothly glue these pieces together to produce the new phrase.
Edinburgh, Scotland, has long been a centre for research in speech synthesis. The Centre for Speech Technology Research (CSTR) at the University of Edinburgh has worked in this area for over 25 years. Two commercial companies, Rhetorical Systems and CereProc, were founded at the University and went on to commercialise and develop much of this research. The aim of researchers in this field is to produce synthetic speech which cannot be distinguished from natural speech, synthetic speech which is as powerful an instrument as our natural voices, able to convey nuance, and emotion.
The popularity of mobile devices and the advent of pervasive computing has intensified interest in synthesis techniques. If a computer needs to communicate with someone, and it can't use text, it has to generate speech. For many new applications there is a requirement for high quality speech synthesis. Inevitably this requires modelling the natural variation present in our voices, and, in some cases, the emotional variation as well. Such requirements are even more important if your synthesised voice is being used to give the power of speech back to a person who has lost it.
Roger Ebert, arguably America's most famous film critic, lost the ability to speak after a thyroid cancer operation. Although he used speech synthesis available on his Apple Mac to communicate, he was frustrated because the voice did not sound like him. CereProc Ltd stepped in to help him. Using hours of Roger's commentaries from DVDs, they were able to create a voice that mimics his original speaking style. Traditional unit selection speech synthesis will always create a voice that mimics the speaker used to record the database. However, in this case, it was also necessary to smooth recordings that were made years apart from each other so that they could be joined together with almost no noticeable discontinuity. These techniques were also used by CereProc to create a satirical 'Bush-o-matic' website which mimicked the speech of George W. Bush.
Although CereProc offers synthetic voices which can synthesise limited emotions, the voice quality required to do this has to be recorded in advance. So, while Roger can manipulate the synthesis to alter pronunciation, pitch and intonation, his voice will reflect the same speaking style in the original commentaries.
Researchers at CSTR are investigating the use of a statistical model of a speaker to generate speech, rather than recombining the speech from a database. Such an approach allows the modification of voice quality and the morphing of one person's voice into another. Even more exciting is the prospect of using much less audio to duplicate a person's voice. This could be of tremendous benefit for those who have lost the power of speech but do not possess extensive archive recordings of their voices like Roger Ebert.
The ability to give a natural sounding voice to animated characters, virtual agents and robots using speech synthesis is a reality. The scope for delivering information using synthesis is immense. However, just as our own power of speech reflects our own humanity, so speech synthesis can add a touch of humanity to our machines and tools, and, in the end, this sensation of seeing ourselves in our machines is perhaps the most strange and fascinating aspect of speech synthesis technology. (Matthew Aylett, University of Edinburgh, www.atomiumculture.eu)