Faster Text-to-Speeches: Enhancing Blind People's Information Scanning with Faster Concurrent Speech

João Guerreiro, Daniel Gonçalves · 2015 · ASSETS '15: Proceedings of the 17th International ACM SIGACCESS Conference on Computers & Accessibility · doi:10.1145/2700648.2809840

Summary

This paper investigates how blind users can scan digital information more efficiently by comparing two approaches: increasing speech rate (the traditional method) versus using concurrent speech (multiple simultaneous voices). The research leverages the "Cocktail Party Effect"—the human ability to focus attention on one voice among several competing conversations while still detecting relevant content in the background. The researchers conducted an experiment with 30 visually impaired participants (12 fully blind, 18 low vision; ages 23-64) who performed a relevance scanning task: listening to six news snippets and identifying which ones related to a specific topic (sports, politics, or entertainment). The experiment used a "Text-to-Speeches" framework that positioned pre-recorded audio in 3D space using Head Related Transfer Functions for spatial localization. To enable fair comparison, the study introduced the concept of "Information Bandwidth" (IB)—how many times faster content is delivered compared to the default speech rate. An IB of 3, for example, could be achieved by one voice at 3x speed, two voices at 1.5x speed each, or three voices at 1x speed. The experiment tested IB conditions from 2 to 6, with one, two, and three concurrent voices. Voices were spatially separated (180° apart for two voices, 90° apart for three) and used different genders (one female, two male) to enhance perceptual segregation.

Key findings

Results revealed a dramatic performance difference between approaches. One-Voice performance degraded consistently as Information Bandwidth increased, with F-Scores dropping from 0.949 (IB 2) to 0.289 (IB 6). In contrast, Two-Voices and Three-Voices maintained high performance across conditions. At IB 3.5, One-Voice achieved only 0.461 F-Score while Two-Voices reached 0.918 and Three-Voices 0.820. The optimal configuration was Two-Voices at 1.75x default rate (~278 WPM), which enabled efficient scanning while maintaining basic comprehension of all sentences. At this setting, participants achieved F-Scores of 0.913 and 0.826 for the IB 3.5 and 4 conditions respectively. Several participants correctly identified all relevant news in these conditions. User characteristics affected performance in predictable ways. Age correlated negatively with performance when speech rates exceeded 2.5x default rate, supporting prior research on speech perception decline with aging. Working memory (measured via digit span) correlated positively with Three-Voices performance, suggesting that managing three concurrent streams requires cognitive resources to filter distracting information. Subjective ratings confirmed the objective findings. Participants rated One-Voice as significantly more difficult at higher IB conditions (median easiness dropping from 6 to 1.5 on a 7-point scale), while Two-Voices ratings remained stable (median 5-6) until the highest IB conditions. Participants reported that concurrent speech allowed them to "focus on a particular voice and also to switch attention to another" whereas very fast single voices only allowed capturing "a few keywords" without deeper understanding.

Relevance

This research has direct implications for screen reader design and auditory interface development. The finding that concurrent speech outperforms very fast single-voice speech challenges the current paradigm where users simply increase speech rate to scan information faster. Screen readers could implement concurrent speech modes for scanning tasks like reviewing email lists, search results, or social media feeds—contexts where users need to quickly identify relevant items without fully processing each one. The spatial audio and voice differentiation techniques used in this study provide a blueprint for implementation. Using different voice genders and spatial positions enhanced stream segregation without requiring frequency manipulation that could degrade voice quality. The Two-Voices configuration balanced efficiency with accessibility—even participants with lower working memory could benefit, unlike Three-Voices which correlated with cognitive resources. For accessibility practitioners, the study highlights the importance of matching interface design to task type. Concurrent speech suits scanning and relevance detection tasks but may not be appropriate for scenarios requiring full comprehension of all content. The age-related performance decline at very high speech rates also underscores the need for configurable interfaces that accommodate diverse user characteristics. The research opens avenues for future work on concurrent speech in web browsing, social media, and notification handling—all contexts where blind users must process large amounts of information to find relevant content.

Tags: blindness · screen readers · text-to-speech · speech rate · concurrent speech · auditory perception · cocktail party effect · information scanning