Benchmarking Web Accessibility Evaluation Tools: Measuring the Harm of Sole Reliance on Automated Tests

Markel Vigo, Justin Brown, Vivienne Conway · 2013 · Proceedings of the 10th International Cross-Disciplinary Conference on Web Accessibility (W4A) · doi:10.1145/2461121.2461124

Summary

This paper provides rigorous empirical evidence on the limitations of automated web accessibility evaluation tools by benchmarking six state-of-the-art tools — AChecker, SortSite, Total Validator, TAW, Deque, and AMP — against expert manual evaluations of three Australian websites (the Prime Minister's site, Vision Australia, and Transperth public transport). Three expert evaluators independently assessed 9 web pages (3 per site) against WCAG 2.0 Level A and AA success criteria, then collaboratively established a ground truth through majority decision with consultation of a legally blind expert user. The expert evaluation identified 650 actual accessibility violations across 26 success criteria. The six automated tools were then run against the same pages, and their results compared using three metrics: coverage (percentage of success criteria the tool reports on at all), completeness (ratio of true positives found versus actual violations), and correctness (ratio of true positives versus total issues reported, measuring false positive rate). The most frequently violated success criteria were 1.3.1 Info and Relationships (135 violations), 1.4.3 Contrast (95), 1.1.1 Non-text Content (95), 1.4.4 Resize Text (83), and 2.4.4 Link Purpose (40).

Key findings

Coverage was alarmingly narrow: at best, only 50% of WCAG 2.0 success criteria were covered by any single tool (TAW), while the worst performer (AMP) covered just 23%. Completeness ranged from 14% (AChecker) to 38% (TAW), meaning even the best tool caught fewer than 4 in 10 actual violations. Critically, the tools with higher completeness scores — TAW (38%) and TotalValidator (32%) — exhibited lower correctness, producing 29% and 34% false positive rates respectively. In contrast, tools with high correctness like SortSite (95%) and Deque (96%) had modest completeness (30% and 28%). The paper found that tools behave more similarly on less accessible websites and more divergently on more accessible ones, because tools are better designed to catch stereotypical, frequent accessibility issues (Perceivable principle violations accounted for 72% of all findings) while subtler violations on more accessible sites go undetected. The authors demonstrated that using the optimal combination of tools for each success criterion could boost completeness to 55% — a 17 percentage point gain — suggesting that strategic multi-tool approaches can partially mitigate individual tool weaknesses. Tool similarity analysis (Cronbach's alpha = 0.96) showed tools are remarkably similar in what they catch, but all remain far from optimal performance.

Relevance

This is one of the most cited and influential studies quantifying the limitations of automated accessibility testing, and its core findings remain relevant today. For practitioners, the headline numbers are stark: relying on automated tools alone means half of WCAG success criteria will not even be analysed, and among those that are, only 4 out of 10 violations will be detected — with a further risk of false positives. The practical implication is clear: automated testing is a necessary starting point but must be complemented by expert evaluation and user testing. The study also reveals a fundamental trade-off in tool design between completeness and correctness — tools that try to catch more violations inevitably generate more false positives. For organisations selecting accessibility testing tools, the paper demonstrates the value of using multiple tools strategically, applying each where it performs best, rather than relying on a single tool. The finding that tools perform better on less accessible sites (catching common, stereotypical issues) but struggle with more accessible sites (where remaining issues are subtler) is particularly important for organisations maturing their accessibility practices.

Tags: automated testing · accessibility evaluation · WCAG compliance · evaluation tools · benchmarking · false positives · conformance testing

Standards referenced: WCAG 2.0