Benchmarking PDF Accessibility Evaluation: A Dataset and Framework for Assessing Automated and LLM-Based Approaches for Accessibility Testing

Anukriti Kumar, Tanushree Padath, Lucy Lu Wang · 2025 · ASSETS 2025: 27th International ACM SIGACCESS Conference on Computers and Accessibility · doi:10.1145/3663547.3746380

Summary

This paper addresses a critical gap in PDF accessibility evaluation by introducing the first expert-validated benchmark dataset and standardized evaluation framework for assessing how well different tools and approaches can evaluate PDF accessibility. Despite PDFs being the dominant format for scholarly communication, less than 3.2% of scholarly PDFs published between 2014 and 2023 meet key accessibility criteria, and over 75% fail to satisfy even a single requirement. The researchers constructed a benchmark of 125 scholarly PDFs systematically modified across seven accessibility criteria derived from WCAG 2.2 and PDF/UA standards: alternative text quality, logical reading order, semantic tagging, table structure, functional hyperlinks, color contrast, and font readability. Each document was labeled using a four-category framework (Passed, Failed, Not Present, Cannot Tell) aligned with W3C evaluation methodology. The corpus was built by selecting 35 representative papers from a larger collection of 20,000 scholarly PDFs, then manually modifying each to create variants representing each evaluation label. All labels were validated by an accessibility specialist with over five years of experience. Using this benchmark, the researchers evaluated five LLMs — GPT-4-Turbo, GPT-4o-Vision, Claude 3.5 Sonnet, Gemini 1.5 Pro, and Llama 3.2 — in a zero-shot prompting configuration, and qualitatively compared results with five leading automated checkers: Adobe Acrobat Pro, PAC 2024, PAVE, axesPDF, and CommonLook PDF Validator.

Key findings

GPT-4-Turbo achieved the highest overall accuracy (0.85), followed by GPT-4o-Vision (0.81), Gemini 1.5 (0.75), Claude 3.5 (0.74), and Llama 3.2 (0.42). All models performed substantially better on Passed and Failed labels than on Not Present and Cannot Tell labels, revealing a fundamental limitation in current LLMs' ability to recognize when evaluation is not applicable or when information is insufficient. Alternative text quality was the hardest criterion across all models — Claude 3.5 hallucinated alt text descriptions in 32% of Cannot Tell cases, fabricating text that did not exist and then evaluating it. Different models showed complementary strengths: Claude 3.5 excelled at structural evaluation (perfect accuracy on logical reading order, 0.90 on semantic tagging), while GPT-4o-Vision dominated visual criteria (perfect accuracy on color contrast). The qualitative comparison with automated checkers revealed that rule-based tools excel at technical verification (detecting missing tags, non-embedded fonts, absent alt attributes) but cannot evaluate semantic quality, while LLMs can assess whether alt text actually conveys an image's meaning or whether reading order logically follows content flow. The researchers propose a three-tiered hybrid evaluation framework: Tier 1 uses automated checkers for structural/syntactic verification; Tier 2 uses LLMs for semantic and contextual assessment; Tier 3 reserves human experts for resolving conflicts and validating high-stakes documents.

Relevance

This research is highly relevant to accessibility practitioners who evaluate document accessibility at scale. The finding that automated checkers detect only 25-30% of accessibility issues — and that no single approach (automated, LLM-based, or manual) can comprehensively evaluate all criteria — validates what practitioners have long observed: accessibility evaluation requires multiple complementary methods. The benchmark dataset (publicly available on GitHub) provides the first standardized way to compare PDF accessibility tools, which has been a major gap in the field. The four-category evaluation framework (Passed, Failed, Not Present, Cannot Tell) offers a more nuanced alternative to simplistic pass/fail assessments and should be adopted more broadly. The three-tiered hybrid approach is immediately actionable for organizations processing large volumes of documents — automated screening can triage documents, LLM evaluation can catch semantic issues at scale, and human review can focus on edge cases and high-stakes content. The hallucination findings are a critical caution for anyone deploying LLMs in accessibility workflows: models may generate convincing but fabricated evaluations, particularly when information is missing.

Tags: PDF accessibility · automated testing · large language models · WCAG · PDF/UA · benchmark dataset · document accessibility · alternative text · reading order · semantic tagging · accessibility evaluation

Standards referenced: WCAG 2.2 · PDF/UA (ISO 14289-1) · Section 508 · European Accessibility Act · Matterhorn Protocol