Extracting content from accessible web pages

Suhit Gupta, Gail Kaiser · 2005 · Proceedings of the 2005 International Cross-Disciplinary Workshop on Web Accessibility (W4A) · doi:10.1145/1061811.1061816

Summary

This paper from Columbia University presents Crunch, a web proxy tool that applies heuristic-based filters to extract core content from web pages by removing clutter such as advertisements, navigation menus, spacer elements, and extraneous links. Crunch works by parsing HTML into a Document Object Model (DOM) tree and then applying a series of tunable filters that prune non-content nodes — identifying link clusters that likely represent navigation, table cells used for layout spacing, and images that can be replaced with their alt text. The system uses a multi-pass architecture where each filtering stage is evaluated: if too much or too little content has been removed, Crunch automatically adjusts its heuristic settings and retries. The tool also includes a genre classification system that clusters websites (news, shopping, government, education) and applies pre-tuned filter settings appropriate to each genre, since optimal extraction parameters differ significantly across site types. The authors initially hypothesized that Crunch would have little effect on WAI-compliant websites, assuming that accessible sites would already be free of clutter. This assumption proved wrong — even technically accessible pages contained significant non-content elements that screen reader users had to navigate through before reaching the actual content.

Key findings

The central finding is that WCAG compliance does not eliminate web page clutter. Even sites that pass automated accessibility checks still force users — particularly screen reader users — to wade through navigation menus, banner images (with alt text), and other non-content before reaching the material they actually want. Testing on sites like NASA Research showed that Crunch could meaningfully reduce page complexity even on accessible pages, moving the main content closer to the top. The authors also found that many sites offering "text-only" versions still had accessibility problems, most commonly non-descriptive link text ("click here", "more") and improperly labelled form fields. The paper demonstrates that accessibility and usability are distinct concerns: a page can be technically accessible while still being cluttered and difficult to use. The authors suggest that accessibility guidelines should include markup to explicitly bracket core content on each page, anticipating the later introduction of HTML5 semantic elements like main, article, and nav.

Relevance

This paper highlights a distinction that remains important today: passing automated accessibility checks is not the same as providing a good user experience for people with disabilities. The clutter problem the authors identified in 2005 has been partially addressed by HTML5 landmark elements (main, nav, aside, article) and ARIA landmarks, which allow screen reader users to jump directly to content. However, many modern sites still lack proper landmark structure, and the core insight — that content extraction and prioritization matter for accessibility beyond mere guideline compliance — remains highly relevant. The paper also anticipated reader mode features now built into browsers like Firefox and Safari, which perform similar DOM-based content extraction. For practitioners, the takeaway is clear: semantic HTML structure that distinguishes content from chrome is not just good practice but essential for assistive technology users.

Tags: content extraction · screen readers · web clutter · DOM · web proxy · visual impairment · web accessibility

Standards referenced: WCAG 2.0