Loudmouth: Modifying Text-to-Speech Synthesis in Noise

Rupal Patel, Michael Everett, Eldar Sadikov · 2006 · Proceedings of the 8th International ACM SIGACCESS Conference on Computers and Accessibility (Assets '06) · doi:10.1145/1168987.1169028

Summary

This short paper from Northeastern University presents Loudmouth, a modified text-to-speech synthesizer that emulates the Lombard effect — the natural way humans adjust their speech in noisy environments — to improve synthesized speech intelligibility in noise. Standard TTS systems become difficult to understand in everyday noisy situations, and simply increasing volume distorts the signal and degrades intelligibility. The researchers modified FreeTTS, an open-source concatenative synthesizer, by adding a Text Analyzer that uses a part-of-speech tagger to identify semantically salient words (nouns, verbs, content words) versus non-salient function words. Based on empirical data about how human speakers modify their speech in noise, Loudmouth applies three acoustic modifications differentially to salient and non-salient words: increased duration (salient words averaged 345ms longer, a 98% increase), raised fundamental frequency (f0 shifted approximately 20 Hz higher for salient words), and amplified intensity (salient words averaged 4dB higher).

Key findings

A perceptual experiment with ten adult monolingual English speakers (5 male, 5 female, mean age 21.3) compared Loudmouth against unmodified FreeTTS in silence and 80dB multi-talker noise. In silence, both synthesizers achieved nearly 100% correct word recognition, confirming that Loudmouth's modifications did not degrade baseline speech quality. In noise, Loudmouth was 7% more intelligible than the standard synthesizer — a statistically significant improvement. This result demonstrates that incorporating linguistically-informed acoustic modifications based on the Lombard effect is a viable approach for improving TTS intelligibility in real-world noise conditions, going beyond simple volume amplification.

Relevance

This research is directly relevant to AAC users who rely on speech synthesis devices as their primary means of communication. In real-world environments — restaurants, classrooms, public transit — background noise severely limits the effectiveness of synthetic speech output, isolating AAC users from conversation precisely when communication matters most. Loudmouth's approach of mimicking natural human speech-in-noise adaptations rather than simply boosting volume represents a more sophisticated and effective strategy. The linguistically-aware modification — enhancing salient content words more than function words — mirrors how human listeners parse speech in noise, focusing processing resources on the most informative parts of an utterance. While the 7% improvement may seem modest, in practical AAC use even small gains in intelligibility can mean the difference between being understood and being asked to repeat. The principle of adapting synthetic speech output to environmental conditions remains relevant to modern TTS systems used in voice assistants, public announcements, and communication aids.

Tags: text-to-speech · speech synthesis · AAC · Lombard effect · speech intelligibility · noise · prosody · assistive technology