On the Intelligibility of Fast Synthesized Speech for Individuals with Early-Onset Blindness
Amanda Stent, Ann Syrdal, Taniya Mishra · 2011 · Proceedings of the 13th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS 2011) · doi:10.1145/2049536.2049574
Summary
This paper reports on a pilot experiment comparing the intelligibility of fast synthesized speech across different text-to-speech (TTS) systems for individuals with early-onset blindness (onset before age seven). People who are blind increasingly use TTS as their primary computer output modality, typically listening at speeds multiple times real time, yet no systematic comparison of TTS system performance had been conducted for this user population. The researchers tested four synthesis systems representing two major approaches: two formant-based synthesizers and two concatenative unit-selection synthesizers. Thirty-six participants with early-onset blindness completed a web-based open-response recall task using semantically unpredictable sentences (SUS) — grammatically correct but meaningless sentences designed to prevent participants from using context to guess words. Each synthesizer produced speech at six speeds ranging from 300 to 550 words per minute (roughly 1.5 to 3 times real time), using both male and female American English voices. Transcription accuracy was measured using cosine similarity between reference and participant transcriptions. The experiment faced practical challenges including screen reader interference with audio playback and the cognitive load of a one-hour web-based task.
Key findings
The study found a significant main effect of speaking rate on intelligibility (F(4,5760) = 59.9759, p < .001), with transcription accuracy decreasing as speed increased. There was a trend towards significance for synthesizer type (F(1,5760) = 3.2563, p = .08), but no significant effect for voice gender. One formant-based synthesizer (FTTS2) maintained transcription accuracy above 0.8 for its male voice across all speaking rates, substantially outperforming the others. Even at 500 words per minute (2.5x real time), all synthesizers maintained accuracy at or above 50%. Post-hoc analyses revealed important participant-related factors: younger participants (under 25) had the highest accuracy and the most gradual decline with speed, while those over 51 had the lowest baseline accuracy and steepest decline. Expert TTS users achieved higher accuracy (0.83 at 350 wpm) than non-expert users (0.76 at 350 wpm). Familiarity with a specific synthesizer appeared to negate or retard the negative impact of increasing speed. Native English speakers achieved higher accuracy at every rate, with the gap widening at higher speeds. The researchers also explored alternative automated metrics for measuring transcription accuracy, finding that Cosine-plain and Dice-plain correlated highly (r = .982) with the manually-processed Cosine metric.
Relevance
This research addresses a critical gap in understanding how people who are blind actually experience the TTS systems they depend on daily. For accessibility practitioners and TTS developers, the findings have direct implications: synthesizer choice matters, and the best-performing engine maintained high intelligibility even at very fast speeds. The fact that familiarity with a synthesizer can offset speed-related intelligibility loss suggests that consistency in TTS engine selection is important for screen reader users — switching engines may temporarily reduce comprehension. The age-related findings are particularly relevant for an ageing population of screen reader users, indicating that older users may need different speed defaults or more gradual speed increases. The study also highlights practical challenges in conducting web-based research with screen reader users, including interference between screen reader speech and experimental audio, offering methodological lessons for future accessibility research. The automated metrics comparison is useful for researchers seeking to evaluate TTS quality without extensive manual post-processing.
Tags: text-to-speech · screen readers · blindness · speech technology · speech intelligibility · assistive technology