Effects of sampling methods on web accessibility evaluations

Giorgio Brajnik, Andrea Mulas, Claudia Pitton · 2007 · Proceedings of the 9th International ACM SIGACCESS Conference on Computers and Accessibility (Assets '07) · doi:10.1145/1296843.1296855

Summary

Brajnik, Mulas, and Pitton's Assets '07 paper is a large, careful empirical investigation of the sampling step in web-accessibility evaluation — the step at which an evaluator picks which pages of a site to actually test. For any site above a trivial size, full coverage is impossible, but the paper argues (correctly) that the wider accessibility-evaluation literature had largely overlooked how much the choice of sampling method biases the resulting accessibility score. The authors enumerate the main families of sampling methods in use: ad hoc selection (home page, site map, contact page, one page per subsection, as recommended by W3C/WAI and UWEM); uniform random sampling; two random-walk methods from the literature (including the European Internet Accessibility Observatory's approach); and stratified sampling based on the 'error profile' of each page — a vector summarising how each automatically-testable checkpoint fares on that page — clustered with PAM or CLARA. The experiment pairs 13 sampling methods with 7 sample sizes (1, 2, 5, 10, 20, 50, 100) and three accessibility metrics: straight WCAG 1.0 AA conformance, the WAQM quantitative metric, and UWEM 1.0. Thirty-two web sites across six genres (newspapers, universities, portals, institutions, local government, operational commerce) were crawled to 1,000 pages each using HttpTrack, giving 32,000 pages of data, and each was analysed with the LIFT accessibility testing tool. For every combination of method, size, metric, and site, the authors drew 30 samples, computed the metric, and measured inaccuracy against the metric value on the full 1,000-page set. Pairwise differences were tested with two-tailed z-tests at alpha = 0.01.

Key findings

The choice of sampling method matters dramatically, but *only* for the WCAG conformance metric — and the size of the effect depends heavily on which metric is being used. With WCAG conformance, inaccuracy ranged from 1.2% (best case, PassFail-cos with a 100-page sample) to over 50% (worst case, with a single-page sample), meaning that more than half of the checkpoints showing a violation could go undetected under a small ad hoc sample. The worst single method overall was ad hoc selection (37.7% mean inaccuracy) — which is, unfortunately, the method most commonly used in practice and recommended by W3C/WAI. The best methods for WCAG were stratified sampling using FailRate or PassFail error profiles with cosine distance. For WAQM and UWEM, by contrast, differences between methods were small: inaccuracy ranged from about 0.5% to only about 4-7% across all methods and sizes, and a surprising finding — flagged in the abstract — was that a *single-page* sample can yield WAQM error below 3.9% and UWEM error below 5%. Sample size has a much larger effect on conformance accuracy than on WAQM or UWEM: 50-page samples are needed for 10% WCAG accuracy, 100 pages for 5%, whereas 10 pages suffice for 2% WAQM/UWEM error. Structural features of the site graph (average degree, clustering index, pages-per-level) showed only weak correlation with sampling accuracy — at most Pearson r ≈ 0.5 between WCAG accuracy and average degree — meaning method choice is largely independent of site topology.

Relevance

For anyone running automated or semi-automated accessibility audits on large sites — accessibility-observatory operators, government compliance units, commercial auditing firms, search engines that weight by accessibility — this paper is one of the earliest rigorous warnings that the sampling method you choose is itself a first-class design decision that can change your reported accessibility score by tens of percentage points. The finding that ad hoc sampling — which remains ubiquitous in compliance audits — is the worst-performing method against conformance metrics is particularly important. The paper's framing of 'inaccuracy' as having both systematic (mean error) and non-systematic (standard deviation) components is a useful formalism for anyone building accessibility dashboards or tracking accessibility over time. Limitations the authors acknowledge: all data from a single tool (LIFT), site sizes capped at 1,000 pages, the assumption that automatically-testable checkpoints share the distribution of manually-testable ones (which is probably optimistic), and a study population skewed toward Italian sites. Methods have evolved since 2007 — WCAG 2.2, AT-Automation rulesets, and ACT Rules changed the underlying checkpoint landscape — but the sampling question and the methodological frame remain directly applicable.

Tags: web accessibility · accessibility evaluation · sampling methods · accessibility metrics · conformance testing · automated testing · WCAG · quality assurance · research methodology · accessibility audit

Standards referenced: WCAG 1.0 · Section 508 · UWEM 1.0