Accessibility in AI-Assisted Web Development

Peya Mowar · 2024 · Proceedings of the 21st International Web for All Conference (W4A) · doi:10.1145/3677846.3679054

Summary

This extended abstract from a CMU doctoral student examines whether AI code generation tools—specifically GitHub Copilot—help or hinder web accessibility when used by developers. The research is motivated by a persistent problem: despite decades of accessibility research, 96.3% of the top one million web pages still contain WCAG 2 failures according to WebAIM, a figure that has improved by only 1.5% over four years. Prior work attributes this to limited accessibility awareness among developers, complexity of accessibility guidelines, and insufficient tooling to translate standards into practical code. Mowar's thesis asks whether code generation models, already integrated into developer workflows as IDE plugins, could serve as a bridge to better accessibility practices—or whether they risk introducing new barriers. The empirical study evaluates GitHub Copilot on six real-world open-source websites spanning business, education, and entertainment categories. For each site, recently resolved UI enhancement issues were selected, the code was reverted to before the fix, and Copilot was prompted with the original issue description and file context. Copilot's suggestions were then compared against the developer's actual code changes, with qualitative analysis determining whether the AI mitigated or introduced WCAG failures.

Key findings

The results reveal a "double-edged sword" dynamic. On the opportunity side, Copilot consistently generated placeholder alt attributes on img tags, nudging developers toward providing alternative text—though the placeholder content was usually empty or irrelevant. Copilot also demonstrated ability to address specific accessibility requirements when explicitly prompted or when they overlapped with functional requirements, such as fixing button contrast issues (though not to exact WCAG ratio specifications). On the challenge side, Copilot's accessibility output was heavily dependent on the existing accessibility level of the codebase: it produced inaccessible elements for already-inaccessible sites and attempted to mimic accessible patterns on compliant sites, though not always successfully. Critically, Copilot never exceeded the accessibility standards set by the human developers. The study also found instances of Copilot "hallucinating" inaccessible elements—for example, adding unlabelled YouTube video links for research papers that were never provided as assets. This contextual learning behaviour mirrors how human developers operate: both tend to reproduce the accessibility patterns (or lack thereof) they encounter in existing code.

Relevance

This research addresses a timely and important question as AI coding assistants become ubiquitous in professional development workflows. The finding that code generation models learn and reproduce the accessibility quality of their context has significant implications: if most training data comes from the 96% of websites that fail WCAG, these models will systematically perpetuate inaccessibility at scale. For accessibility practitioners, this paper highlights the urgent need to advocate for accessibility benchmarks in LLM code evaluation (currently focused almost exclusively on functional correctness), push for accessibility-aware training data and fine-tuning, and educate developers that AI assistants cannot be relied upon for accessibility compliance. The paper is an early-stage thesis proposal with a limited dataset, but it opens an important research direction. Future work plans to conduct task-based studies with professional developers to measure how Copilot usage affects WCAG compliance in practice.

Tags: AI code generation · web accessibility · developer practices · GitHub Copilot · automated testing · WCAG compliance

Standards referenced: WCAG 2.0