Capti-Speak: A Speech-Enabled Web Screen Reader

Vikas Ashok, Yevgen Borodin, Yury Puzis, I. V. Ramakrishnan · 2015 · Proceedings of the 12th International Web for All Conference (W4A) · doi:10.1145/2745555.2746660

Summary

This paper presents Capti-Speak, a speech-augmented screen reader for web browsing that allows blind users to combine natural language voice commands with traditional keyboard shortcuts. Built as an extension to the Capti Narrator screen reader, Capti-Speak addresses a fundamental frustration with conventional screen readers: users must memorize extensive keyboard shortcuts and navigate sequentially through page content to find what they need, which is especially difficult when content labels are missing or pages are unfamiliar. Capti-Speak translates spoken utterances into browsing actions using a custom dialog act model designed specifically for non-visual web access. The system architecture includes five components: an automatic speech recognizer (Google ASR), an input interpreter that classifies utterances into dialog acts (Command-Task, Command-Navigation, Information-Task) and extracts target objects, a task manager that locates matching DOM elements and executes actions, a dialog manager that maintains conversational context, and a response generator that provides text-to-speech feedback. The input interpreter uses a decision-tree-based dialog act recognizer trained on a corpus from an earlier Wizard-of-Oz study examining how blind people naturally use speech for web browsing. A key technical innovation is the target extractor, which uses part-of-speech tagging to identify the type and descriptors of intended targets from natural language, then performs approximate matching against DOM nodes using Levenshtein distance to compensate for speech recognition errors.

Key findings

A user study with 20 blind participants (10 men, 10 women; mean age 47; 40% expert screen reader users, 60% beginners) compared Capti-Speak to keyboard-only screen reading across four real-world tasks on Amazon, Stanford University, Craigslist, and Gmail. Capti-Speak was significantly faster overall (mean 191.9 seconds vs 284.3 seconds, p=0.0002), with the speed advantage consistent across all demographic subgroups. SUS usability scores were dramatically higher for Capti-Speak (mean 83.5 vs 47.0, p<0.001). On a modified SUS questionnaire, 90% of participants preferred Capti-Speak, 85% found it easiest to use, and 90% thought people would learn it quickest. Despite a 30% speech recognition error rate from the Google ASR, the dialog act recognizer achieved 84% accuracy by being resilient to recognition errors — for example, correctly interpreting "Go to Stafford Admission Link" (misrecognized from "Stanford") as a navigation command and still finding the correct target via approximate string matching. Beginners benefited especially, as they typically used only five keyboard shortcuts (arrow keys, Tab, Spacebar, Enter) whereas experts used ten. Notably, no participant chose to use speech for form filling, preferring the keyboard for data entry due to ASR unreliability with names and addresses. Participants strongly agreed they would like to use speech utterances for web tasks (mean 4.3/5).

Relevance

This research was prescient in demonstrating that voice-augmented screen reading — combining speech commands with keyboard shortcuts rather than replacing them — significantly improves web accessibility for blind users. The finding that a multimodal approach (voice plus keyboard) outperforms either modality alone has become increasingly relevant as voice assistants have matured. The system's resilience to 30% speech recognition errors through intelligent dialog act classification and approximate string matching offers practical lessons for building robust voice interfaces for assistive technology. The study also reinforces well-known accessibility problems: improperly labeled form fields on major websites like Amazon directly impacted task completion, and even with speech input, these labeling failures remained barriers. For practitioners, the key insight is that voice interfaces for screen reader users should support natural, context-aware commands ("go to search box" rather than requiring exact element names) and should complement rather than replace keyboard interaction. The work anticipates the current trend toward AI-powered accessibility tools that interpret user intent rather than requiring precise commands.

Tags: screen readers · speech recognition · voice interface · web accessibility · blind users · natural language processing · assistive technology