Accessibility Metatesting: Comparing Nine Testing Tools
Jonathan Robert Pool · 2023 · Proceedings of the 20th International Web for All Conference (W4A '23) · doi:10.1145/3587281.3587282
Summary
This short paper presents a systematic empirical comparison of nine automated web accessibility testing tools that are amenable to integration into multi-tool testing regimes (free or nearly free, controllable via APIs or NPM packages, and comprehensive in scope). The nine tools compared are: alfa (Siteimprove), axe-core (Deque), Continuum (Level Access), HTML CodeSniffer (Squiz), Equal Access (IBM), Nu Html Checker (W3C), QualWeb (Universidade de Lisboa), Tenon (Tenon.io), and WAVE (WebAIM). The author tested 121 web pages of interest to CVS Health — a judgmental sample from 140 pages representing internal home-built, vendor-produced, external supplier, and competitor pages that an enterprise might monitor. The W3C lists 167 accessibility testing tools, but the cost of integrating multiple tools motivates the question: is there a small subset that makes additional tools redundant? The tools collectively ran 1,327 tests classified into 245 distinct issues (defects and suspected defects). The author took a deliberately empirical approach, analyzing what tools actually reported rather than what they claim to test, and trusted tool claims rather than employing human testers to classify results as true or false positives.
Key findings
The tools differed dramatically in the number of issue instances reported, ranging from QualWeb (23,715 instances) to Continuum (3,089) — an eightfold difference. However, sheer volume did not equate to comprehensiveness. For every pair of tools (72 pairs total), each tool found instances on some pages that the other missed — meaning no two-tool combination made either tool redundant. Adding a second tool increased the issue instance count by 13% to 767% depending on the pairing, and increased the number of distinct issues discovered by 15 to 51 issues. Each tool had distinct specializations: alfa excelled at font/line sizing and skip-to-content links; axe-core at color contrast and landmark placement; HTML CodeSniffer at heading levels and semantic element use; QualWeb at video alternatives and focus indication; WAVE at label clarity and link purpose. Critically, every tool had at least 7 issues that only it discovered — no other tool found them. For example, only WAVE found instances of pages missing landmarks, and only axe-core found instances of invisible form-control labels. The conclusion is unequivocal: no single tool or subset can substitute for the full set, and testing with all nine was substantially more informative than any smaller combination.
Relevance
This research has immediate practical implications for organizations building accessibility testing programs. The finding that every tool contributes unique discoveries that no other tool catches is a strong argument against relying on a single testing tool — a common practice in many organizations that adopt just axe or WAVE. For enterprise accessibility teams, the paper provides a data-driven case for investing in multi-tool integration, while acknowledging the real costs: different invocation methods, reporting formats, severity classifications, and issue-location approaches must all be harmonized. The tool specialization list is directly useful for teams that need to prioritize — if an organization cares most about color contrast, axe-core is strongest; for heading structure, HTML CodeSniffer excels. A limitation is that the study was conducted on CVS Health pages, and tool performance may vary across different types of web content. Also, Tenon became unavailable to new subscribers after the research, illustrating the volatility of the tool landscape.
Tags: automated accessibility testing · web accessibility · accessibility testing · accessibility evaluation · WCAG compliance · developer tools · accessibility violations · tools
Standards referenced: WCAG