Does ChatGPT Generate Accessible Code? Investigating Accessibility Challenges in LLM-Generated Source Code
Wajdi Aljedaani, Abdulrahman Habib, Ahmed Aljohani, Marcelo Eler, Yunhe Feng · 2024 · Proceedings of the 21st International Web for All Conference (W4A) · doi:10.1145/3677846.3677854
Summary
This paper presents the first empirical evaluation of the accessibility of web code generated by ChatGPT (GPT-3.5), examining both how accessible the generated code is and how well the model can fix accessibility violations. The study involved 88 web developers who prompted ChatGPT with natural language descriptions to create websites with diverse characteristics and elements — importantly, without explicitly requesting accessible code, to test whether accessibility is incorporated by default. The generated websites were evaluated using two established accessibility checker tools (AChecker and WAVE) against WCAG 2.2 guidelines. Developers then prompted ChatGPT to fix the identified violations. Additionally, the researchers selected 88 open-source web projects from GitHub (filtered for 50+ stars, active maintenance, and core web technologies) and had ChatGPT attempt to fix the accessibility violations found in those third-party codebases. AChecker detected 296 violations across the generated sites while WAVE found 572, categorized by WCAG principle, guideline, and success criterion. The study addresses four research questions covering accessibility conformance, violation types, self-repair capability, and ability to fix third-party code.
Key findings
The majority (84%) of ChatGPT-generated websites contained accessibility violations, with only 14 of 88 sites being violation-free. Issues were overwhelmingly concentrated under the Perceivable principle (83.6% of violations), followed by Operable (6.6%), Understandable (3.8%), and Robust (0.1%). The most common specific violations were italic text without resize options (277 instances), low color contrast (232), missing labels for form inputs (146 instances across Info and Relationships), and missing alt attributes on images (56). ChatGPT achieved a 70% success rate (806 of 1,153 violations) in fixing its own generated code and a 73% rate for third-party GitHub code. Performance varied dramatically by guideline: 100% fix rates for Non-text Content, Labels or Instructions, Page Title, Language of Page, and Parsing, but only 60% for Contrast, 40% for checkbox-related labeling issues, and 0% for file input labels and certain heading structure problems. The model showed inconsistency even within guidelines — fixing 53 of 59 "header following h1 is incorrect" issues but only 1 of 9 "header following h2 is incorrect" issues. Conformance levels stayed at Level A only; no ChatGPT-generated site reached AA compliance. The violation types mirrored those found in human-created websites per WebAIM's Million analysis, suggesting LLMs trained on existing web code perpetuate the same accessibility failures.
Relevance
This study delivers a critical warning for the growing practice of using LLMs for web development: ChatGPT does not generate accessible code by default, and developers should not assume that AI-generated code meets accessibility standards. The four key takeaways are immediately actionable: (1) without explicit accessibility prompting, ChatGPT produces inaccessible websites that mirror the same failures found across the web; (2) prompt engineering matters — developers must explicitly request accessible code and be knowledgeable about WCAG to guide the model; (3) accessibility should never be presumed when using LLMs; and (4) LLM training data and fine-tuning need to specifically incorporate accessibility examples. The 70% self-repair rate is promising but insufficient for compliance, leaving 30% of issues unresolved and requiring human expertise. For organizations adopting AI-assisted development, this research underscores the need for accessibility testing as a mandatory step in any LLM-augmented development workflow, not an optional afterthought. The study is limited to GPT-3.5's free version and web-only code, but the patterns likely extend to other LLMs trained on similar corpora.
Tags: web accessibility · large language models · ChatGPT · automated testing · WCAG · code generation · AI accessibility
Standards referenced: WCAG 2.2