Feedback-based evaluation tool for web accessibility

Daisuke Asai, Masahiro Watanabe, Yoko Asano · 2007 · Proceedings of the 9th International ACM SIGACCESS Conference on Computers and Accessibility (Assets '07) · doi:10.1145/1296843.1296883

Summary

This short Assets '07 demonstration paper from NTT Cyber Solutions Laboratories proposes a feedback-driven approach to improving automated web-accessibility evaluation tools. The authors frame the familiar problem that automated checkers such as Bobby only catch a fraction of real accessibility problems because their evaluation rules — the concrete, programmable translations of abstract WCAG checkpoints — are inevitably incomplete. A single WCAG 1.0 checkpoint like 'provide a text equivalent for non-text elements' can spawn many concrete rule variants (IMG missing alt, APPLET missing alt, and so on), and no rule author can anticipate every HTML pattern or new technology that appears in real-world content. Their proposed solution is essentially a crowdsourced learning loop wrapped around an evaluation tool. After running an automated scan, users are shown each reported failure with the relevant source highlighted, and are invited to tag the result as 'informative', 'not informative', or 'incorrect', optionally with a free-text comment. These assessments feed back into the tool's rule set — triggering new rules, refinements of existing rules, or bug patches. The authors implemented the tool as a public web service, seeded it with 79 rules derived from WCAG 1.0, and ran a seven-month field test (November 2006 to June 2007) open to anyone who wanted to check their own pages.

Key findings

Over the seven-month field test, the tool was used 10,097 times by 1,793 distinct users, 77% of whom had experience creating web content. Despite no monetary or points-based incentive, 58 of those users submitted 95 assessments, of which 59 included free-text comments. The distribution of assessments was telling: 33 'informative', 3 'not informative', and 61 'incorrect' — meaning users were overwhelmingly motivated to leave feedback when the tool was wrong about their own content, which the authors note turned irritation into useful data. Every submitted assessment was judged relevant to accessibility, and many contained concrete suggestions for refining evaluation rules. The authors translated the feedback into 46 concrete changes: 3 entirely new evaluation rules, 37 improvements to existing rules, and 6 bug patches. A representative example in the paper is an evaluation rule that had been incorrectly flagging links which opened new windows even when the link text contained the phrase 'new window' — a false positive spotted and described by a user in the field.

Relevance

This paper is an early, small-scale but prescient demonstration of crowd-sourced improvement of accessibility tooling, anticipating the feedback loops now standard in modern automated testing tools (axe, WAVE, Lighthouse) and in machine-learning systems trained on human-labelled error corpora. The core observation — that users encountering false positives in an accessibility scan are highly motivated to push back, and that their corrections are almost always on-topic and usable — remains relevant for anyone building or tuning automated checkers today. It also carries a quiet warning: automated tools ship with imperfect rules, and the quality of those rules depends on a continuous correction pipeline, not a one-off design effort. Limitations of this paper are substantial — the sample of engaged assessors was tiny (58 users, 95 assessments) compared to total tool usage, the evaluation is not quantitative about resulting precision gains, and the WCAG 1.0 ruleset is long obsolete — but the mechanism it describes is durable and worth revisiting.

Tags: web accessibility · automated testing · accessibility evaluation · crowdsourcing · WCAG · accessibility tools · evaluation rules

Standards referenced: WCAG 1.0