On the Testability of WCAG 2.0 for Beginners
Fernando Alonso, José Luis Fuertes, Ángel Lucas González, Loïc Martínez · 2010 · Proceedings of the 2010 International Cross Disciplinary Conference on Web Accessibility (W4A) · doi:10.1145/1805986.1806000
Summary
This paper investigates whether WCAG 2.0 success criteria are truly "reliably human testable" — one of the standard's key design goals — when the evaluators are beginners rather than experts. The W3C defines reliably human testable as meaning that at least 80% of knowledgeable evaluators would agree on the finding. The authors conducted an experiment during an intensive one-week web accessibility course held at the Technical University of Madrid as part of the ATHENS European exchange program in March 2009. Seventeen students with no prior accessibility training evaluated the same web page (the university's English-language homepage) against all 25 Level-A success criteria of WCAG 2.0. Students could rate each criterion as pass, fail, partial, not applicable, or unknown. Two expert instructors independently evaluated the same page to establish the "correct" values, merging partial and fail into a single fail category for stricter conformity assessment. The experiment examined both whether students agreed with each other and whether they arrived at the correct result, analyzing the factors that led to unreliable evaluations.
Key findings
Only 8 of the 25 Level-A success criteria (32%) met the 80% agreement threshold for reliable human testability by beginners. Even at a relaxed 64% threshold, 13 success criteria still could not be considered reliably evaluated. The instructors themselves initially agreed on only 13 of 25 criteria, highlighting the difficulty even for experts evaluating with the then-new WCAG 2.0. The authors identified three root causes for unreliable evaluations: comprehension problems (difficulty understanding WCAG 2.0 language and concepts like "video-only content" or "sequence affecting meaning," affecting 7 criteria), knowledge gaps (insufficient technical knowledge to perform the evaluation, affecting 4 criteria), and effort issues (students not spending enough time or thoroughness, affecting 7 criteria). Several criteria were consistently problematic across both this study and a comparable study by Brajnik: SC 2.1.1 (timing adjustable), SC 2.2.2 (pause, stop, hide), SC 2.4.1 (bypass blocks), SC 2.4.4 (link purpose in context), and SC 3.3.2 (labels or instructions). Instructors consistently found more accessibility failures than students, suggesting beginners tend toward more lenient evaluations.
Relevance
This study has direct implications for how organizations train accessibility evaluators and how much confidence they can place in evaluations performed by less experienced staff. The finding that only 32% of Level-A criteria were reliably testable by beginners challenges the assumption that WCAG 2.0's improved testability language is sufficient on its own — training, practice, and support materials are essential complements. The three-category analysis of failure reasons (comprehension, knowledge, effort) provides a practical framework for designing accessibility training curricula: courses need to emphasize WCAG terminology and concepts, build technical evaluation skills, and develop thorough evaluation habits. For organizations building accessibility programs, this research reinforces that automated tools alone are insufficient and that human evaluators need significant training before their manual assessments can be considered reliable. The consistently problematic success criteria identified across studies can help trainers focus attention on the areas where beginners are most likely to make errors.
Tags: WCAG evaluation · accessibility education · testability · inter-rater reliability · manual testing · accessibility training
Standards referenced: WCAG 2.0 · WCAG 1.0