Measuring and comparing the reliability of the structured walkthrough evaluation method with novices and experts

Christopher Bailey, Elaine Pearson, Voula Gkatzidou · 2014 · Proceedings of the 11th Web for All Conference (W4A) · doi:10.1145/2596695.2596696

Summary

This paper investigates the reliability of the Structured Walkthrough evaluation method — a systematic approach to manual accessibility evaluation embedded in the Accessibility Evaluation Assistant (AEA) tool. The AEA contains 48 accessibility heuristics organised into five categories (Design, User, Structural, Technical, and Global checks), each with step-by-step instructions, rationale explaining why the check matters, and video demonstrations. The Structured Walkthrough method adapts expert evaluation processes for use by novices by breaking each heuristic into components: the accessibility principle, a summary, the affected user groups, the barrier description, testing instructions, and verification steps. The study compared evaluations from 28 final-year undergraduate computing students (novices) and 6 experienced accessibility practitioners from AbilityNet, all with over 10 years of experience conducting WCAG audits. Both groups evaluated the same websites (Fitness First and Pure Gym) using the same 15 checks drawn proportionally from all five AEA categories, covering areas such as image text alternatives, colour contrast, keyboard navigation, skip links, headings, form labels, and code validation.

Key findings

Expert evaluations achieved 76% overall reliability compared to 65% for novices — a meaningful but not enormous gap given that the novices had no prior accessibility evaluation experience. For novices, all but one of the 15 checks exceeded the 50% reliability threshold considered the minimum acceptable level, and 8 checks (53%) reached 60% or higher. For experts, all 15 checks exceeded 50% reliability, with 7 (47%) reaching 80% or above. Notably, novices actually outperformed experts on three checks: Keyboard Navigation, Skip Navigation Link, and Form Labels. However, novice validity was lower at 48% overall, dragged down by three checks where novices misunderstood the evaluation criteria — particularly Images of Text (14% validity), Link Names (0%), and Skip Navigation Link (1%). Expert qualitative feedback praised the method as "simple to understand and well structured" and "much simpler than WCAG 2.0 and more directed," though experts noted it could not replace a full WCAG audit. The main expert criticism was the subjectivity of the met/part met/not met decision-making process and insufficient guidance on resolving detected barriers.

Relevance

This research directly addresses one of the most persistent challenges in accessibility practice: the evaluator effect, where different auditors find different problems on the same page. The finding that a structured method can bring novice reliability to within 11 percentage points of expert performance is significant for organisations that lack dedicated accessibility specialists. The Structured Walkthrough method offers a practical model for scaling accessibility evaluation by giving non-specialists enough scaffolding to produce reasonably reliable results — useful for preliminary assessments or triage before expert review. The expert feedback that the tool works well for initial screening but cannot replace comprehensive WCAG audits provides a realistic framing for how such tools should be positioned in organisational accessibility workflows. The study also reinforces that certain checks (like identifying text in images or evaluating link quality) require deeper expertise that structured methods alone cannot fully compensate for.

Tags: accessibility testing · web accessibility evaluation · WCAG compliance · evaluator effect · accessibility education · heuristic evaluation

Standards referenced: WCAG 2.0