A System for Creating Personalized Synthetic Voices

Debra Yarrington, Chris Pennington, John Gray, H. Timothy Bunnell · 2005 · Proceedings of the 7th International ACM SIGACCESS Conference on Computers and Accessibility (Assets '05) · doi:10.1145/1090785.1090827

Summary

This paper presents the ModelTalker Voice Creation System, a tool that enables individuals to create personalized synthetic voices with unrestricted vocabulary for use in augmentative and alternative communication (AAC) devices. The system addresses a significant problem in AAC: approximately 2 million people in the United States have limited ability to speak due to conditions including cerebral palsy, stroke, head trauma, cancer, multiple sclerosis, muscular dystrophy, and ALS, yet current communication devices offer only a handful of generic synthetic voices. This leads to absurd situations where an eight-year-old child communicates with the same voice as a 45-year-old man, or multiple students in a classroom share identical voices. ModelTalker is particularly valuable for people with ALS and other progressive conditions who know they will lose the ability to speak — they can record their own voice while they still can and later use a synthetic version that preserves their vocal identity. The system uses data-based (concatenative) synthesis rather than older rule-based formant synthesis, producing more natural and intelligible speech while keeping the recording burden manageable.

Key findings

The ModelTalker system consists of three components. InvTool guides users through recording up to 1,650 phrases — including 87 phrases from the Generic Message List for AAC users with ALS, function words in various contexts, short and long phrases, nonsense sentences, and common content words — while monitoring pitch, amplitude, and pronunciation consistency and using speech recognition to detect gross mispronunciations. BCC (the database creation tool) refines phoneme boundary locations based on the speaker's specific acoustic characteristics, then judges and prunes unacceptable phonemes based on duration, amplitude, voicing percentage, and other features relative to the speaker's distribution. The ModelTalker TTS engine synthesizes speech by converting text to phoneme strings and searching for the best-matching recorded phoneme pairs, preferring continuous segments from the original recordings for natural quality and falling back to acoustically similar pairs when needed. The entire system runs on a home PC, outputs SAPI-compliant voices compatible with any standard speech application, and users can include custom words and phrases they expect to use frequently to ensure those have recorded-speech quality.

Relevance

ModelTalker represents a pioneering approach to voice preservation that has become increasingly relevant as awareness grows about the importance of personal identity in AAC. The ability to bank one's own voice before losing the ability to speak addresses a profound emotional and practical need — synthetic speech is not just about being understood, but about sounding like yourself. For accessibility practitioners, this work highlights that voice output in AAC is not a solved problem simply because text-to-speech exists; the quality, naturalness, and personal identity of the voice matter deeply to users. The system's design — downloadable for home or clinical use, SAPI-compliant for broad device compatibility, and designed to minimize recording burden for people with progressive conditions — demonstrates thoughtful attention to real-world deployment constraints. This early work laid important groundwork for the voice banking services that are now more widely available to people facing speech loss.

Tags: speech synthesis · voice banking · AAC · amyotrophic lateral sclerosis · text-to-speech · personalized voice · assistive technology · concatenative synthesis

Standards referenced: SAPI