Extracting content from accessible web pages
Suhit Gupta, Gail Kaiser · 2005 · Proceedings of the 2005 International Cross-Disciplinary Workshop on Web Accessibility (W4A)
This paper from Columbia University presents Crunch, a web proxy tool that applies heuristic-based filters to extract core content from web pages by removing clutter such as advertisements, navigation menus, spacer elements, and extraneous links. Crunch works by parsing HTML…
content extraction · screen readers · web clutter · DOM · web proxy