Scanning for Digital Content: How Blind and Sighted People Perceive Concurrent Speech
João Guerreiro, Daniel Gonçalves · 2016 · ACM Transactions on Accessible Computing (TACCESS) · doi:10.1145/2822910
Summary
This paper investigates whether blind and sighted people can leverage concurrent speech — multiple audio streams playing simultaneously — to more efficiently scan and identify relevant digital content, exploiting the well-known Cocktail Party Effect. Screen readers currently present information sequentially through a single audio channel, forcing blind users to listen through potentially long lists of items one by one to find relevant content. Sighted users similarly increasingly consume information through audio (podcasts, audiobooks, text-to-speech) but face the same sequential bottleneck. The authors conducted a controlled experiment with 46 participants (23 blind, 23 sighted) who listened to news snippets presented through two, three, or four concurrent spatially separated speech sources. Each trial contained one relevant sentence (matching a given topic keyword) among irrelevant sentences, and participants had to identify which source contained the relevant content and then report its meaning. The speech sources were spatially positioned using head-related transfer functions (HRTFs) to create the sensation of sound coming from distinct locations around the listener — left, right, center, and back. The experiment varied both the number of concurrent sources and whether different or same voices were used across sources. Performance was measured on identification accuracy (which source was relevant), intelligibility (comprehension of the content), and subjective ratings.
Key findings
Both blind and sighted participants performed comparably across all conditions, with no significant differences between groups — a key finding suggesting that concurrent speech solutions can follow a Design for All approach rather than requiring separate interfaces. With two concurrent sources, 39 of 46 participants achieved 100% identification accuracy, making it highly viable for scanning tasks. With three sources, 36 of 46 participants still identified the relevant source in at least 5 of 6 trials. Four sources proved challenging, with average identification around 50%, though some participants still performed well. Sound source location was by far the most important cue for identification — participants preferred and relied on spatial separation over voice differences. Surprisingly, using different voices for each source did not significantly improve identification or intelligibility performance, contradicting expectations from prior literature on smaller speech signals. However, participants strongly preferred different voices (33 vs 2), reporting greater confidence and ease. Working memory, measured by Digit Span scores, was significantly correlated with blind participants' ability to recall content, suggesting it should be used to calibrate the number of concurrent sources offered to individual users. The authors proposed four usage scenarios: scanning information items (search results, news feeds), scanning within documents (paragraphs read simultaneously), secondary audio notifications alongside primary listening, and multitouch-to-multisound interfaces on tablets.
Relevance
This research opens a fundamentally new interaction paradigm for non-visual information access. Instead of speeding up a single sequential audio stream — the current approach used by screen reader power users who listen at 2-3x normal speed — concurrent speech offers parallel information delivery that could dramatically reduce the time blind users spend scanning content. The finding that blind and sighted users perform equally well is significant for universal design: interfaces leveraging concurrent speech could serve everyone who consumes audio information, from screen reader users to podcast listeners to drivers using voice interfaces. For accessibility practitioners, the practical guidelines are clear: two concurrent sources are reliable for most users and tasks; three sources work for identification tasks with lower intelligibility demands; spatial separation is the primary design lever; and working memory capacity should inform individual calibration. The connection to WAI-ARIA is also notable — concurrent speech could serve as a secondary notification channel for dynamic content updates that currently interrupt the primary audio stream.
Tags: visual impairment · concurrent speech · auditory interface · screen reader · cocktail party effect · spatial audio · information scanning · universal design
Standards referenced: WAI-ARIA