Computer vision-based analysis of web page structure for assistive interfaces

Michael Cormier · 2016 · Proceedings of the 13th International Web for All Conference (W4A) · doi:10.1145/2899475.2899506

Summary

This doctoral consortium paper proposes a novel approach to understanding web page structure by analyzing rendered page images using computer vision, rather than relying on the DOM tree or source code as most existing web page segmentation methods do. The author argues that the visual rendering of a web page is actually a better representation of the page's semantic structure than its underlying code, because designers create the visual layout to convey structure to users, while the source code is merely an intermediate representation intended to produce correct rendering. Critically, images, Flash objects, embedded PDFs, and other non-HTML content appear as "black boxes" in the DOM tree but their internal structure is plainly visible in the rendered image. The proposed system has two components: a segmentation algorithm that recursively divides the page image into a tree of axis-parallel rectangular regions, and a classification algorithm that labels each region with its semantic role using ARIA role labels. Both use principled Bayesian approaches. The system is intended to power front-end assistive applications such as more selective screen reader output for visually impaired users and clutter reduction or content zooming for users with cognitive disabilities.

Key findings

The segmentation algorithm exploits properties unique to web page images—regions are axis-parallel rectangles and child regions fully cover their parents without overlapping—to dramatically reduce the search space compared to natural image segmentation. It identifies edges that "stand out" from surroundings using Bayesian probability, addressing a key challenge: subtle background color changes can be structurally significant while strong edges within text paragraphs are irrelevant. The classification algorithm uses a hidden Markov tree (HMT) probabilistic graphical model, structured according to the segmentation tree, to infer the most likely ARIA role labels for all regions simultaneously based on visual features (position, size, aspect ratio, text/image proportions) and inter-region relationships. Pearl's belief propagation algorithm enables efficient inference. The first version of the segmentation algorithm had already been tested with good results at time of publication. The author planned a user study with elderly users using a front-end interface that suppresses non-focus regions to assist with cognitive challenges. The approach offers implementation independence—it works regardless of whether content is HTML, embedded images, Flash, or PDFs, and is unaffected by changes in implementation languages like the HTML5 transition.

Relevance

This research addresses a genuine gap in assistive technology: the reliance on well-structured HTML and ARIA markup for screen readers and other assistive tools to work effectively. Since much web content lacks proper semantic markup, and embedded objects like images, PDFs, and interactive media are opaque to DOM-based tools, a vision-based approach could provide accessibility information where none currently exists. The potential applications are compelling—automatically inferring ARIA roles for pages that lack them, enabling screen reader navigation through embedded content, and reducing visual clutter for users with cognitive disabilities. The approach is particularly relevant for complex educational web pages with embedded slides and interactive elements. However, as a doctoral proposal, the work was at an early stage: only the segmentation algorithm had been tested, the classification system was under development, and no user study had yet been conducted. The fundamental trade-off is that analyzing rendered images loses access to the text content itself, so a practical system would likely need to combine vision-based structural analysis with text extraction from the DOM.

Tags: computer vision · web page segmentation · screen reader accessibility · cognitive accessibility · machine learning · assistive technology · ARIA

Standards referenced: WAI-ARIA