Validity and Reliability of Web Accessibility Guidelines

Giorgio Brajnik · 2009 · Proceedings of the 11th International ACM SIGACCESS Conference on Computers and Accessibility (Assets '09) · doi:10.1145/1639642.1639666

Summary

This paper presents an experiment measuring the validity and reliability of WCAG 1.0 and WCAG 2.0 checkpoints when applied by human evaluators. The study recruited 35 young web developers with some accessibility knowledge from a university course and asked them to evaluate two web pages against 21 preselected checkpoints from each guidelines set (42 checkpoints total). Validity is defined as the extent to which all and only true accessibility problems are identified, while reliability is the extent to which different evaluators of the same page reach the same results. The author frames these as key quality criteria for any accessibility evaluation method, noting that despite WCAG 2.0's explicit claim of being based on "testable" criteria, no scientific evidence had been produced to verify this. The experiment used a within-subjects design with randomized ordering to control for learning effects. Evaluators rated each checkpoint as fail, pass, or not applicable, and also scored the applicability difficulty (how hard it was to determine if the checkpoint applied) and evaluability difficulty (how hard it was to determine the outcome) on 0-to-4 scales. An independent expert judge provided gold-standard ratings for comparison.

Key findings

The results are sobering for accessibility evaluation practice. No checkpoint from either guidelines set achieved reliability (max-agreement) definitively higher than 80% — the threshold used in defining "reliably human testable." Max-agreement ranged from 36% to 94%, with the minimum on a 3-point scale being 33% (essentially random). WCAG 1.0 was actually superior to WCAG 2.0 in both reliability and validity, with differences ranging from 1% to 13% in reliability and 6% to 20% in validity — a substantial finding given that WCAG 2.0 was specifically designed to be more testable. Global accuracy rates were low at 57% to 61%, meaning evaluators got the correct answer only slightly better than chance for some checkpoints. The worst-performing checkpoints for reliability included "Sensory Characteristics," "Device Independent Interfaces," and "On Input." Checkpoints expected to be unambiguous, like "Clear Language" and "Image Equivalence," sometimes had surprisingly low reliability, while some subjective-seeming checkpoints performed better than expected. Applicability and evaluability difficulties showed a high correlation with each other but only a weak negative correlation with max-agreement, meaning perceived difficulty did not reliably predict actual reliability problems.

Relevance

This study has profound implications for how organizations approach WCAG conformance. If trained evaluators agree on checkpoint outcomes only 57-61% of the time, then single-evaluator conformance audits — the industry norm — are inherently unreliable. The finding that WCAG 2.0 did not improve on WCAG 1.0 in testability challenges the narrative that accompanied its release. For practitioners, this research argues strongly for using multiple evaluators, combining automated and manual testing, and treating conformance claims with appropriate uncertainty. The study also highlights that the perceived difficulty of evaluating a checkpoint does not predict its actual reliability, which means evaluators may not be aware of when they are most likely to make errors. While the sample of junior evaluators limits generalizability to expert auditors, the author argues this population represents an important segment of the workforce actually conducting accessibility evaluations. This work remains highly cited in discussions about the limitations of guidelines-based accessibility evaluation.

Tags: accessibility evaluation · WCAG · conformance testing · inter-rater reliability · web accessibility · guidelines

Standards referenced: WCAG 1.0 · WCAG 2.0