Active Learning for Web Accessibility Evaluation

Mengni Zhang, Can Wang, Zhi Yu, Chao Shen, Jiajun Bu · 2017 · Proceedings of the 14th International Web for All Conference (W4A) · doi:10.1145/3058555.3058559

Summary

This paper introduces "active-prediction," a semi-supervised machine learning method that addresses a fundamental bottleneck in web accessibility evaluation: the prohibitive cost of evaluating all pages in a large website. Current practice relies on sampling methods (ad hoc, uniform random, random walk, stratified) to select a representative subset of pages, but these approaches are inherently limited — undersampling creates bias while oversampling is expensive, and homepage-only evaluation has been shown to poorly represent whole-site accessibility. The authors reframe accessibility evaluation as a prediction problem: rather than sampling and reporting only on the sample, they use a small number of strategically selected pages as training data to predict the accessibility results of all remaining pages. The method works in four steps: (1) use active learning via the Manifold Adaptive Experimental Design (MAED) algorithm to select the most informative pages from the website, (2) evaluate those selected pages using automated tools and human assessment, (3) train SVM classification models for each accessibility checkpoint using the evaluated pages as training data, and (4) predict accessibility results for all remaining pages. The key innovation is using active learning rather than random selection for the training pages — MAED exploits the local invariance structure of the page data (similar pages likely have similar accessibility results) to select pages that maximize prediction accuracy. Each page is represented by 45 HTML tag features (title, button, area, table, alt, etc.).

Key findings

Experiments on 30 Chinese government websites (totaling approximately 50,000 pages, ranging from 626 to 12,991 pages per site) showed that active-prediction achieved high accuracy with only 1% of pages used as training data. Most individual website accuracies exceeded 0.90, with a mean above 0.95, meaning 99% of pages could be accurately predicted without direct evaluation. In the accessibility evaluation comparison, active-prediction consistently produced the lowest error rates compared to four other methods: uniform random sampling, active learning sampling (without prediction), stratified sampling, and random-prediction. Notably, stratified sampling — often considered superior to random sampling — actually performed worse than uniform random sampling on these datasets because the highly skewed distribution of accessibility violations (problematic pages existing in only a small portion of the site) caused stratified methods to overestimate the distribution of problematic pages. Active-prediction overcame this by predicting results for all pages rather than extrapolating from a sample. The dataset came from the Chinese Government Website Accessibility Evaluation Campaign, using China's national web accessibility standard YD/T1761-2012.

Relevance

This research offers a practical solution to one of the most persistent challenges in accessibility practice: how to evaluate large websites cost-effectively without sacrificing accuracy. For organizations managing sites with thousands or tens of thousands of pages, evaluating every page is infeasible, yet sample-based approaches can miss critical issues on rarely-visited but important pages (such as CAPTCHA-only forms that trap screen reader users). The active-prediction approach provides a middle ground: invest in carefully evaluating a small, strategically selected subset, then use machine learning to reliably predict the rest. This is particularly relevant for government compliance monitoring, where large-scale evaluation across hundreds of sites is needed. The finding that stratified sampling can actually increase bias in skewed distributions is a valuable caution for practitioners who assume stratified approaches are always superior. Limitations include the binary (pass/fail) checkpoint model which loses nuance, the use of only SVM classifiers, and the need for the HTML feature extraction step which may not capture all accessibility-relevant page characteristics.

Tags: accessibility evaluation · machine learning · active learning · web accessibility · automated testing · sampling methods · accessibility metrics · SVM · semi-supervised learning · WCAG compliance · China · large-scale evaluation

Standards referenced: WCAG 1.0 · WCAG 2.0 · WCAG-EM