Quantitative Metrics for Measuring Web Accessibility

Markel Vigo, Myriam Arrue, Giorgio Brajnik, Raffaella Lomuscio, Julio Abascal · 2007 · Proceedings of the 2007 International Cross-Disciplinary Conference on Web Accessibility (W4A) · doi:10.1145/1243441.1243465

Summary

This paper addresses a fundamental limitation of WCAG conformance levels (A, AA, AAA): they function as binary pass/fail thresholds that cannot distinguish between a site meeting all Priority 1 checkpoints and one meeting all Priority 1 plus nearly all Priority 2 checkpoints — both receive an A rating. The authors propose the Web Accessibility Quantitative Metric (WAQM), a normalized 0-100 score automatically calculated from the output of automated evaluation tools. The metric is designed for three application scenarios: quality assurance within web engineering (tracking accessibility across development iterations), information retrieval (re-ranking search results by accessibility, as Google's experimental Accessible Search attempted), and accessibility monitoring (tracking how a site's accessibility evolves over time and comparing sites in observatory-style rankings). The WAQM accounts for several factors that simpler metrics miss: it uses relative failure rates (errors divided by tested cases) rather than raw error counts, weights violations by WCAG priority level (0.80 for Priority 1, 0.16 for Priority 2, 0.04 for Priority 3), excludes generic problems that cannot be automatically verified, and applies a hyperbolic transformation to spread out the discriminative range where most real-world pages cluster (low error-to-tested-cases ratios). Results are also broken down by WCAG 2.0's POUR principles (Perceivable, Operable, Understandable, Robust), providing both per-principle scores and an overall weighted average.

Key findings

The authors validated the metric by evaluating 1,363 web pages across 15 websites (10 universities, 5 newspapers) using two different automated tools: EvalAccess and LIFT. The results revealed that absolute WAQM scores are tool-dependent — EvalAccess produced a median score of 69 while LIFT produced a median of 28 for the same pages, a statistically significant difference of 36-38 points. This disparity exists because LIFT covers more checkpoints with more granular tests, detecting more potential errors. However, Spearman's rank correlation between the two tools was 0.719 (moderate-high), meaning the tools produce consistent relative rankings even if absolute values differ. At the website level, correlation was 0.735. The authors also discovered that template-driven sites (particularly newspapers) showed very low variability in accessibility scores across pages — if the template is accessible, the whole site benefits; if it is flawed, every page suffers equally. The metric was validated against expert manual rankings of Spanish university websites and showed positive correlation. The conclusion is that WAQM is reliable for ranking and monitoring purposes across tools, but absolute comparisons require using the same tool consistently.

Relevance

This paper tackled a problem that remains central to accessibility practice: how to move beyond binary conformance verdicts toward nuanced, comparable measurements. The WAQM approach anticipated features now common in modern accessibility monitoring platforms like Deque's axe Monitor, Siteimprove, and Level Access — all of which produce quantitative scores and trend data rather than simple pass/fail reports. The finding that metric results are tool-dependent is critically important for practitioners: organizations cannot meaningfully compare accessibility scores generated by different tools, even using the same metric formula. This remains true today and is a frequent source of confusion when organizations switch testing tools mid-project. The discovery that template-driven sites have uniform accessibility profiles reinforces a practice still relevant today: fixing accessibility at the template or component level is the most efficient strategy for large sites. The paper's priority weighting scheme also foreshadowed discussions around WCAG 2.x conformance about whether all success criteria within a level should carry equal weight.

Tags: accessibility metrics · automated testing · accessibility evaluation · quality assurance · web accessibility monitoring · WCAG conformance · accessibility measurement

Standards referenced: WCAG 1.0 · WCAG 2.0 · Section 508