← All reviews

DOM block clustering for enhanced sampling and evaluation

Simon Harper, Anwar Ahmad Moon, Markel Vigo, Giorgio Brajnik, Yeliz Yesilada · 2015 · Proceedings of the 12th International Web for All Conference (W4A) · doi:10.1145/2745555.2746649

Summary

This paper addresses a fundamental challenge in web accessibility evaluation: large websites with thousands or tens of thousands of pages are practically impossible to fully evaluate, yet current sampling methods (random, best-guess, or convenience samples) cannot be trusted because they lack understanding of the website's underlying structure. The authors introduce the concept of "website demographics" — quantifiable statistics about a site's population of pages — and propose a technique based on Document Object Model (DOM) block-level similarity to cluster pages that share the same template structure. The method works in four stages: crawling the website, building a filtered DOM representation for each page (stripping content and retaining only structural elements), comparing DOMs to identify structurally identical pages, and then clustering them into groups. From each cluster, representative pages are selected for evaluation using stratified sampling, with the assumption that pages sharing the same DOM structure will exhibit the same types of accessibility issues. The tool was tested on five websites from different Alexa categories (arts, health, news, sports, teens), each crawled to approximately 20,000 pages. The source code was made openly available following the science code manifesto.

Key findings

The DOM block clustering technique achieved 80% site coverage by evaluating only approximately 0.1-4% of pages across the five test sites. For example, healthlinks.com (7,050 pages) required only 10 representative pages (0.1%) to cover 80% of the site, while ocala.com (19,473 pages) needed just 68 pages (0.3%). The cluster size distribution followed a clear Zipfian pattern: a few clusters contained the majority of pages, while most clusters were small (typically groups of 2). This power-law distribution explains why such dramatic reduction is possible. The authors tested four demographic profiles with varying levels of element inclusion — the strictest (d1, excluding all inline elements) produced the fewest clusters (437 for the Manchester CS site), while the most inclusive (d4, retaining anchor and image elements) produced more (504). However, the technique had limitations: lonelyplanet.com/france had 33% unclassified pages and never reached 80% coverage through clustering alone. The strict DOM comparison also created too many small clusters from minor structural variations, suggesting a distance metric with a tolerance threshold would improve results.

Relevance

This research directly addresses a practical bottleneck in enterprise accessibility work: the impossibility of manually evaluating every page on a large website. The concept of website demographics provides an objective, reproducible foundation for sampling that removes the subjectivity of current ad-hoc approaches. For accessibility practitioners, the key insight is that most large websites are built on a relatively small number of templates, and fixing accessibility issues in one template propagates benefits across all pages using that template. This aligns with the "readily achievable" prioritization concept — fixing the template used by the largest cluster delivers the greatest accessibility improvement per unit of effort. The technique complements the W3C's Website Accessibility Conformance Evaluation Methodology (WCAG-EM), which requires identifying common page types and essential functionalities but relies on manual identification. While the strict DOM comparison has limitations with highly dynamic or non-templated sites, the underlying approach of structure-based clustering remains a valuable strategy for making large-scale accessibility evaluation feasible.

Tags: accessibility evaluation · automated testing · web crawling · sampling methodology · DOM analysis · website demographics · WCAG compliance · clustering

Standards referenced: WCAG 2.0 · WCAG 1.0 · WCAG-EM