Sasayaki: Augmented Voice Web Browsing Experience

Daisuke Sato, Shaojian Zhu, Masatomo Kobayashi, Hironobu Takagi, Chieko Asakawa · 2011 · Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '11) · doi:10.1145/1978942.1979353

Summary

This paper introduces Sasayaki (Japanese for 'whisper'), a prototype that augments the standard screen-reader voice with a second, physically separated synthesised voice that whispers contextually relevant hints — for example 'entering main content', 'skipped the main', 'the price is $120', or 'close to main'. The authors argue that although synthesised speech has opened the web to an estimated billion people with visual or reading limitations, a single sequential speech channel loses structural cues that sighted users rely on (headings, landmarks, visual grouping, price-vs-button layout) and leaves blind users unable to form an overview of a page. Existing strategies — transcoding, semantic re-authoring, heading-jump shortcuts — help but still operate through one voice. Sasayaki instead runs alongside the primary screen reader as an IBM aiBrowser plug-in, drawing on an Accessibility Commons metadata server plus on-page text analysis to generate: (1) spatial cues about the current cursor position, (2) social cues aggregated from volunteer-authored metadata, (3) analytical cues such as sentiment summaries of product reviews, and (4) 'jump' shortcuts to role-tagged regions. The primary voice plays through the laptop speaker while Sasayaki is piped through a separate USB speaker next to it, mimicking the experience of a sighted friend whispering over the user's shoulder. The authors ran a within-participants study with nine blind Japanese users across four conditions and five tasks on Asahi, Nikkei, Amazon, Yahoo, and Amazon search-results pages.

Key findings

The Sasayaki jump function significantly reduced task-completion time (F = 65.23, p < .001) and keystroke counts (F = 80.09, p < .001). Average completion times fell from 112 s (no-whisper, no-jump) and 126 s (whisper only) to 71 s (jump only) and 65 s (both). Keystrokes dropped from 194 to 69. Error rates fell dramatically: seven of nine participants failed the Amazon task within the time limit without Sasayaki, while zero failed with both whisper and jump. The whisper function on its own produced no statistically significant speed gain, but it substantially improved subjective ratings for confidence, pleasantness, and 'feeling sure I would finish', and reduced the number of backtracking events visible in navigation traces. Participants did not report confusion from the two simultaneous voices; several explicitly said the voices were 'sufficiently distinguishable'. Some whispers were misunderstood (e.g. 'close to main' was interpreted too literally by two participants), suggesting that the whispered vocabulary needs training or personalisation. Overall Sasayaki increased information density on a single audio channel without raising cognitive cost.

Relevance

Sasayaki reframes screen-reader design around the idea that a second, low-priority audio stream can carry structural and contextual information in parallel with primary speech — much as a sighted reader benefits from peripheral vision. The results are highly relevant for modern screen readers and browser extensions: they suggest that richer ambient cues about landmarks, skipped content, and page structure can substantially improve task speed and user confidence. The concept generalises to in-car voice assistants, mobile eyes-busy scenarios, and accessible telephony. Limitations include a small all-blind, all-Japanese, all-aiBrowser sample, a constrained set of pre-tagged websites, and reliance on the Accessibility Commons metadata infrastructure. The paper also stops short of examining how two simultaneous voices affect users with cognitive load concerns, or how the whispered content should be authored, personalised, or internationalised. Still, it remains an influential early demonstration that multi-voice audio interfaces are usable and welcome for blind web users.

Tags: screen readers · auditory interface · voice browser · web accessibility · blindness and low vision · non-visual interaction · assistive technology · multimodal · speech technology

Standards referenced: ARIA