A Flexible VXML Interpreter for Non-Visual Web Access
Yevgen Borodin · 2006 · Proceedings of the 8th International ACM SIGACCESS Conference on Computers and Accessibility (Assets '06) · doi:10.1145/1168987.1169066
Summary
This doctoral consortium paper from Stony Brook University presents VXMLSurf, an open-source VoiceXML interpreter being developed as part of the HearSay project for non-visual web browsing. VoiceXML is the W3C standard for specifying interactive voice dialogs, widely used in telephone systems but also applicable to converting web content into voice-navigable dialogs for blind users. Existing screen readers like JAWS and IBM's Home Page Reader can speak web content but provide limited interactivity and rigid dialog management. The HearSay system takes a different approach: it uses Mozilla's rendering engine to parse web pages into a frame tree, then a dialog generator creates multiple layers of VoiceXML dialogs that are processed by VXMLSurf. The three dialog layers include basic screen-reading, BFS/DFS navigation (breadth-first/depth-first search of page structure), and domain-specific dialogs. VXMLSurf is written in Java with three separate threads for input, output, and VoiceXML processing, using FreeTTS for speech synthesis and designed to incorporate CMU Sphinx for speech recognition. The project was developed in collaboration with the Helen Keller School for the Blind.
Key findings
VXMLSurf extends standard VoiceXML processing with several features specifically designed for web browsing by blind users. Advanced voice controls allow users to pause, resume, restart utterances, and change pitch, voice, rate, and intensity of speech. Additional shortcuts enable skipping content at various levels (sentence, paragraph, section) and are treated as events with predefined handlers. Key controls were borrowed from JAWS to maintain familiarity. The interpreter supports event handling that allows defining new events and overriding defaults, making it highly customizable. A significant design insight was that while VoiceXML is powerful enough for dialog management, using it for all navigational controls led to excessive dialog size. The solution was to implement navigation controls within the interpreter itself, reducing VoiceXML dialog complexity. The system was designed to be extended to support adaptive dialogs and will be evaluated progressively by students at the Helen Keller School for the Blind.
Relevance
This paper represents an early step in rethinking how blind users interact with the web beyond traditional screen readers. While screen readers overlay an auditory interface on visual web pages, the HearSay/VXMLSurf approach converts web content into structured voice dialogs that can support more natural, mixed-initiative interaction — the user can interrupt, ask questions, and navigate through speech rather than simply listening to linearized page content. The open-source, modular design was forward-thinking, enabling researchers to experiment with different dialog strategies and navigation approaches. Although VoiceXML-based web browsing did not become the dominant paradigm, the underlying concept of converting web content into interactive voice dialogs anticipated modern voice assistant interactions with web content. The collaboration with the Helen Keller School for the Blind exemplifies the importance of involving target users in the development of assistive technology from the design phase onward.
Tags: VoiceXML · non-visual browsing · voice browser · blind users · screen readers · speech recognition · web accessibility · open source
Standards referenced: VoiceXML 2.0