Getting One Voice: Tuning Up Experts' Assessment in Measuring Accessibility

Silvia Mirri, Paola Salomoni, Ludovico A. Muratori, Matteo Battistelli · 2012 · Proceedings of the International Cross-Disciplinary Conference on Web Accessibility (W4A) · doi:10.1145/2207016.2207023

Summary

This paper addresses a fundamental challenge in web accessibility evaluation: how to reconcile the subjective assessments of multiple human experts into a single, reliable accessibility measurement. While automated testing tools produce binary pass/fail results for detectable errors, many accessibility barriers require human judgment — and different experts inevitably assign different severity ratings. The authors present cBIF (combined Barriers Impact Factor), an extension of their earlier BIF metric, which provides a mathematical model for synthesizing multiple expert evaluations alongside automated tool results into a unified accessibility score. The work builds on the VaMoLa system (Accessibility Validator and Monitor), developed in collaboration between the University of Bologna and the Emilia-Romagna Region in Italy, which combines an automated validator based on AChecker with the AMA (Accessibility Monitoring Application) for large-scale periodic monitoring of public administration websites. The cBIF formula maps each detected error to one of seven barrier categories — screen reader/blindness, screen magnifier/low vision, color blindness, input device independence/movement impairments, deafness, cognitive disabilities, and photosensitive epilepsy — and computes a weighted combination of automatic and manual evaluation scores, normalized by the number of checks performed. Crucially, the metric includes configurable parameters that allow evaluators to set the relative importance of manual versus automatic assessments for each barrier type.

Key findings

The cBIF metric introduces a statistical model where each expert rates barriers on a continuous [0,1] scale independently (without seeing other experts' ratings), and the mean and variance of these ratings are computed to derive a consolidated manual evaluation score. The variance serves as a reliability indicator — high agreement among experts strengthens confidence in the result, while high variance signals disagreement requiring further investigation. In a preliminary experiment, five experts evaluated 10 Italian public administration websites against WCAG 2.0 success criterion 1.1.1 (non-text content), with the manual evaluation weight parameter set to 2 and the automatic weight to 1, reflecting the assumption that human judgment about barrier severity is more meaningful than binary automated detection. The experiment revealed notable disagreement among experts on certain image evaluations, particularly around edge cases like overly verbose alt text on decorative images versus missing alt text on functional images used as links. The paper identifies this inter-rater variability as both a challenge and a valuable data point for understanding which barriers are genuinely ambiguous.

Relevance

This research tackles a problem that remains highly relevant to accessibility practice: the gap between what automated tools can detect and what actually constitutes a barrier for users. Most organizations rely heavily on automated scanning, which catches only an estimated 30-40% of WCAG issues, yet there is no widely adopted standard for how to aggregate and weight manual expert findings. The cBIF approach offers a structured framework for organizations conducting accessibility audits with multiple evaluators, providing a principled way to handle disagreement rather than simply averaging scores. The barrier-to-disability mapping (seven categories tied to specific assistive technologies) is a practical contribution that connects abstract WCAG success criteria to real user impact. The work's main limitation is scale — the preliminary experiment covered only one success criterion with five evaluators — and the question of how many experts are needed for reliable results remains open.

Tags: accessibility metrics · accessibility evaluation · manual evaluation · automated testing · expert assessment · WCAG compliance

Standards referenced: WCAG 2.0