Measuring the Semantic Accessibility Gap in LLM-Generated Web UIs

Tommaso Calo, Alexandra-Elena Gurita, Luigi De Russis · 2026 · Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems (CHI EA '26) · doi:10.1145/3772363.3799364

Summary

Calo, Gurita, and De Russis investigate a blind spot of mainstream automated accessibility tools: while scanners like Axe-core reliably catch missing alt attributes or unlabelled form fields (syntactic violations), they cannot tell whether the values present are actually meaningful. An image with alt='image' passes every automated check yet conveys nothing to a screen-reader user. The authors call this the semantic accessibility gap and ask how prevalent it is in interfaces generated by today's code-producing LLMs. They generate 300 web UIs using three commercial models (Claude Sonnet-4, Gemini-2.5-flash, GPT-4o) from a diverse set of real-website prompts and categorise semantic violations into six fault types: non-descriptive alt text, vague link purpose, generic button labels, heading/content mismatches, inaccurate or generic ARIA labels, and generic form labels. To measure these at scale, they propose an LLM-as-judge methodology in which an LLM evaluates the generated HTML against a structured prompt describing each fault type. Judge accuracy is validated through controlled fault injection — deliberately inserting 721 known semantic faults into UI variants and measuring detection recall — and triangulated through a preliminary human annotation study in which four HCI researchers evaluated 324 UI components across 9 UIs on a custom component-level annotation platform.

Key findings

Across the 300 original interfaces, judges identified 541 semantic accessibility violations (roughly 1.8 per UI). The distribution was dominated by interactive elements that require contextual knowledge unavailable from the DOM alone: generic button labels (27%), vague link text (26%), poor alt text (20%), ARIA issues (14%), and generic form labels (12%). Violations per UI varied sharply by generator: Claude averaged 0.3/UI, Gemini 1.5/UI, GPT 1.8/UI. The LLM-as-judge approach achieved 80-92% recall against injected faults, with alt-text violations easiest to detect (~95%) and heading/content mismatches hardest (~51-68%), reflecting the greater cross-element reasoning required. No single judge dominated: Gemini offered the highest recall but the greatest intra-rater variability; GPT was most consistent but less sensitive; Claude balanced both. In the human study, inter-annotator agreement was fair (Cohen's kappa ~ 0.24) and LLM-human agreement was comparable to human-human agreement, while LLMs aligned more strongly with a curated ground truth (kappa 0.69-0.94 vs 0.23 for humans), suggesting either that injected faults are artificially clean or that humans apply more context-sensitive judgements. The findings argue that LLM judges can extend accessibility evaluation into the semantic dimension and could serve as reward signals for constitutional AI or RLHF training pipelines that produce more accessibility-aware code generators.

Relevance

This paper is important for anyone whose codebase is increasingly authored — or scaffolded — by LLMs. Accessibility practice has long relied on two layers of testing: automated scans for syntactic issues, and manual review for meaning. LLM code generators collapse that division by producing syntactically compliant but semantically empty accessibility metadata at scale, which creates the illusion of compliance while delivering no real benefit to assistive-technology users. The paper's enumeration of six semantic fault types is directly usable as an audit checklist and could be incorporated into review tooling or lint rules. The LLM-as-judge validation methodology also matters for accessibility research more broadly: controlled fault injection with recall measurement is a cleaner evaluation strategy than ad-hoc spot checks, and could be applied to validate other AI-powered accessibility assessors (automatic alt-text generators, heading-structure recommenders, etc.). Limitations: the evaluation is on isolated HTML files, not stateful web applications with navigation, auth, or dynamic content; the human study is small (N=4 annotators, 324 components); and the fault-injection patterns may be easier for LLMs to detect than naturally occurring subtle violations. The study does not involve assistive-technology users, so the lived-experience grounding called for by the authors remains future work.

Tags: web accessibility · large language models · LLM code generation · semantic accessibility · WCAG · LLM-as-judge · automated testing · alt text · ARIA · AI and accessibility

Standards referenced: WCAG 2.1