Testability and Validity of WCAG 2.0: The Expertise Effect

Giorgio Brajnik, Yeliz Yesilada, Simon Harper · 2010 · Proceedings of the 12th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS 2010) · doi:10.1145/1878803.1878813

Summary

This paper investigates the testability and validity of WCAG 2.0 success criteria through an empirical study with 22 accessibility experts and 27 non-experts (university students with 14 hours of accessibility training). Participants evaluated all 61 WCAG 2.0 success criteria against four complex real-world web pages (Facebook, IMDB, Bloomberg, Scientific American). The W3C defines testability as achieving 80% agreement among knowledgeable human evaluators on whether a criterion passes or fails. The study measured two key properties: reliability (whether different evaluators produce the same results, measured by maximum agreement) and validity (whether evaluations are correct, measured by correctness, sensitivity, and F-measure, with "correct" defined as the majority expert opinion). The expert group comprised published researchers and professional consultants in accessibility, with significantly higher self-rated knowledge of accessibility (median 5/5) and WCAG 2.0 (median 4/5) compared to non-experts (median 2/5 for both). Experts completed evaluations in a mean of 115 minutes versus 370 minutes for non-experts. The study controlled for order effects through randomised success criteria lists and Latin Square page assignment.

Key findings

Approximately 50% of WCAG 2.0 success criteria failed to meet the 80% agreement threshold even among experts. Only 19 of 61 success criteria (31%) consistently reached 80% agreement across both pages evaluated by experts. The overall mean maximum agreement for experts was 77% — close to but below the 80% threshold. For validity, experts achieved 80% correctness (20% false positives), 68% sensitivity (missing 32% of true problems), and an F-measure of 0.72. Non-experts performed significantly worse on all measures: agreement dropped by 6-10%, correctness fell to 58% (42% false positives), sensitivity dropped to 51% (missing 49% of true problems), and F-measure fell to 0.51. Even basic Level A criteria like "1.4.1 Use of color" and "2.2.1 Timing adjustable" were not reliably testable, while easily identifiable criteria like "3.1.1 Language of page" and "2.1.1 Keyboard operability" consistently achieved high agreement. Experts took roughly one-third the time of non-experts and reported significantly higher confidence in applying WCAG 2.0. The study also notes that some invited experts refused to participate, arguing that certain WCAG 2.0 criteria are too subjective to evaluate reliably.

Relevance

This paper challenges a fundamental assumption underlying web accessibility compliance: that WCAG 2.0 conformance can be reliably determined through human inspection. The finding that even experts agree on only about half of the success criteria at the 80% threshold has profound practical implications. Organisations legally required to achieve WCAG 2.0 conformance face the reality that two independent expert audits may produce substantially different results — as illustrated by the real-world email to the W3C-WAI mailing list cited in the paper. For accessibility practitioners, several findings are directly actionable: expertise matters significantly (non-experts produced nearly double the false positives), so investing in training and experienced evaluators has measurable impact on evaluation quality; even expert evaluations should be treated as probabilistic rather than deterministic, suggesting that multiple evaluators or complementary methods (automated testing, user testing) should be combined; and certain success criteria are inherently more subjective than others, requiring particular care. The study is also methodologically significant — it is one of the largest empirical investigations of WCAG 2.0 testability and provides all stimuli, materials, and datasets publicly for replication.

Tags: WCAG 2.0 · web accessibility · conformance review · evaluator effect · accessibility audit · guidelines · expertise · evaluation methodology

Standards referenced: WCAG 2.0 · Section 508