Dialog Generation for Voice Browsing

Zan Sun, Amanda Stent, I. V. Ramakrishnan · 2006 · Proceedings of the 2006 International Cross-Disciplinary Workshop on Web Accessibility (W4A): Building the Mobile Web: Rediscovering Accessibility? · doi:10.1145/1133219.1133228

Summary

This paper presents HearSay, a voice browser system developed at Stony Brook University that provides speech-driven web access for people with visual disabilities. Unlike conventional screen readers that force users to arrow through a linearized, single-column presentation of all page content including navigation links and advertisements, HearSay automatically segments web pages into semantically related content blocks using structural and content analysis of the DOM tree. The system's Content Analyzer uses a pattern mining algorithm that works bottom-up on the DOM tree, identifying repeating structural patterns to group related content — such as clustering all headline news items together. The result is a "partition tree" that represents the logical organization of page content. HearSay's Interface Manager then automatically generates a VoiceXML dialog interface to this partitioned content, allowing users to navigate the page hierarchy using speech commands. A key innovation in this version is the move from domain-specific ontologies and hand-built templates (used in the original HearSay) to a general-purpose system that can handle any web page. The system uses machine learning classifiers trained on human-labeled data to select appropriate presentation strategies at runtime, rather than relying on predefined rules for specific website types.

Key findings

HearSay introduces two complementary navigation strategies: breadth-first navigation (BFN), which presents all child partitions of a section so users can choose, and depth-first navigation (DFN), which presents children one at a time with yes/no choices — critical when partitions have many children (the paper cites Miller's finding that humans can hold roughly seven items in working memory). Users can switch between strategies and adjust verbosity across three levels, from minimal type information to full structural and content summaries. The system uses trained classifiers to make three key decisions at runtime: whether a partition is a "browsing" partition (content to be read aloud) or a "searching" partition (requiring a summary for navigation); whether searching partitions should receive structural or content-based summaries; and which sentences within content-based partitions are important enough to include in summaries. Using support vector machines with 44 structural features, the classifiers achieved 86% accuracy for browsing/searching classification and 88% for summary type classification. Decision tree classifiers for identifying important sentences within partitions achieved 89.8% accuracy using the J48 algorithm, compared to a 79.8% baseline of simply selecting the first sentence. The system exploits visual formatting cues in the original HTML — such as font size and position — to identify important content, effectively reverse-engineering the visual emphasis that sighted users would perceive.

Relevance

HearSay represents an important early attempt to solve a problem that persists in screen reader usage today: the cognitive overload of navigating complex web pages through linear audio presentation. The system's approach of automatically segmenting pages and generating hierarchical dialog interfaces anticipated features that later appeared in modern screen readers, such as landmark navigation and heading-based browsing. The use of machine learning to classify page regions and select presentation strategies was ahead of its time, foreshadowing current AI-powered accessibility tools. For practitioners, the paper's analysis of browsing versus searching behavior within a single page remains relevant — it highlights why semantic structure and meaningful headings matter so much for non-visual users. The verbosity control system also illustrates an important accessibility principle: different users need different levels of detail, and interfaces should adapt to expertise level rather than imposing a one-size-fits-all approach.

Tags: voice browsing · screen readers · visual impairment · web page segmentation · content summarization · machine learning · VoiceXML · dialog systems · assistive technology

Standards referenced: VoiceXML 2.0