VoxBoox: A System for Automatic Generation of Interactive Talking Books

Aanchal Jain, Gopal Gupta · 2006 · Proceedings of the 8th International ACM SIGACCESS Conference on Computers and Accessibility (Assets '06) · doi:10.1145/1168987.1169052

Summary

This poster paper from the University of Texas at Dallas presents VoxBoox, a system that automatically converts HTML-coded digital books into interactive talking books accessible via telephone. The system works by translating HTML pages into VoiceXML — the W3C standard markup language for voice-driven interfaces — and then enhancing the VoiceXML output with navigation controls and interactive features. A user dials a toll-free number to connect to a voice browser, selects a book by speaking its title, and can then listen to the content with full navigational control through speech commands and voice input. The architecture consists of four components: a CGI gateway that handles user requests, a transcoder that converts HTML to VoiceXML, an enhancer that adds navigation controls (skip, pause, forward, backward) and voice anchor bookmarks, and a voice browser that delivers the audio to the user. The system leverages existing web infrastructure — any book published in HTML on the web can be made accessible without requiring the publisher to do additional work.

Key findings

The prototype was operational and addressed several key limitations of both existing talking book formats and standard VoiceXML. Unlike audio cassettes and CDs which offer limited navigation, VoxBoox provides dynamic controls including skip, repeat, pause, keyword search, and voice anchors (bookmarks) that let users mark and return to specific passages — similar to flipping pages in a physical book. Unlike DAISY digital talking books which require specialized players or software, VoxBoox is accessible over any telephone, making it usable even when a computer is unavailable or cannot be operated (such as while driving). The system also overcomes VoiceXML's inherent limitation that page navigation is controlled by the page author rather than the listener, by automatically inserting navigation tags and controls during the enhancement phase. The voice anchor feature allows users to place speech-labeled bookmarks on any paragraph and navigate between them, and keyword search lets users jump to forms containing a specific word.

Relevance

VoxBoox represents an early attempt to solve a problem that remains relevant: making the vast amount of digital text content on the web accessible through audio with meaningful navigation, not just linear reading. While screen readers and modern TTS have advanced considerably since 2006, the core insight that aural navigation of long-form content requires different interaction paradigms than visual browsing remains important. The system's telephone-based access model was innovative for its time, providing access without requiring a computer or specialized software. The approach of automatically transforming existing HTML content rather than requiring publishers to create separate accessible versions reflects the principle that accessibility should scale with existing content rather than depending on individual author effort. Although VoiceXML and telephone-based interfaces have largely been superseded by smartphone apps and smart speakers, VoxBoox's navigational concepts — bookmarking, keyword search, section skipping — are features that modern accessible reading applications continue to implement.

Tags: talking books · VoiceXML · aural navigation · visual impairment · blind users · HTML transformation · text-to-speech · accessible publishing

Standards referenced: VoiceXML 2.0 · HTML