Content Extraction

Also known as: Web Content Extraction, Text Extraction

The process of separating meaningful content from the surrounding structural markup, navigation elements, and boilerplate text on a web page. For assistive technology users, content extraction is valuable because it allows them to focus on the substantive information on a page without having to navigate past repeated headers, menus, advertisements, and other non-content elements. Techniques include identifying and removing page templates, detecting updated versus static content, and presenting only the primary content area. Modern equivalents include reader mode in web browsers and the use of semantic HTML landmarks to identify main content regions.

Category: Web Accessibility · Assistive Technology · Information Accessibility · Content

Related: Web Navigation · Screen Reader · Speech-Based Navigation · Semantic HTML

Sources

https://doi.org/10.1145/354324.354343