Accessible PDFs: Applying Artificial Intelligence for Automated Remediation of STEM PDFs

Felix M. Schmitt-Koopmann, Elaine M. Huang, Alireza Darvishy · 2022 · Proceedings of the 24th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS 2022) · doi:10.1145/3517428.3550407

Summary

This Ph.D. research paper presents a plan to leverage artificial intelligence to automate the remediation of PDF documents from STEM fields, addressing one of the most significant barriers to information access for people with visual impairments. The Portable Document Format remains the dominant format for scientific publishing, yet the vast majority of PDFs lack the structural tags that screen readers need to present content meaningfully. The paper identifies three core reasons why most PDFs remain inaccessible: authors lack awareness of PDF accessibility requirements, the remediation process demands expensive specialized software and expert knowledge of tagging guidelines, and the manual tagging process is extremely time-consuming even for accessibility experts working with simple documents. STEM documents compound these challenges because they feature complex multi-column layouts, tables, figures, and critically, mathematical formulae that have no standardized accessible representation in the PDF specification. The researchers propose building upon the existing open-source PAVE (PDF Accessibility and Validation Engine) tool to create PAVE 2.0, integrating deep learning models for three key document analysis tasks: page object detection to identify logical content elements like headers, paragraphs, tables, and formulae; mathematical formula recognition to convert formula images into markup languages like MathML; and reading order detection to ensure screen readers present content in the correct logical sequence. The project also introduces FormulaNet, a new labeled dataset for training page object detection models that includes formula annotations, addressing a gap in existing training data.

Key findings

The paper identifies that existing automated tagging tools, particularly Adobe Acrobat Pro's auto-tagger, fail significantly with complex STEM documents, producing incorrect tags that can cause reading order to jump between columns, miss headers, or erroneously add alternative text to images. Rule-based OCR systems like InftyReader achieved only a 67% BLEU score on formula recognition across 60,000 scientific papers, meaning substantial manual correction is still needed. The research proposes a novel approach to mathematical formula accessibility in PDFs through a web-based math viewer concept, inspired by JAWS screen reader's tree-structure navigation. Rather than presenting formulas as lengthy linear alternative text (e.g., the quadratic formula requires 23 words in MathSpeak notation), the math viewer would allow users to navigate formulas hierarchically, stepping into and out of sub-expressions. The researchers developed FormulaNet, a new page object detection dataset that includes formula labels, and are building their detection model on the Generalized Focal Loss V2 architecture. Four planned user studies will evaluate both the automated tagging quality and the math viewer concept with screen reader users.

Relevance

This research directly addresses a critical equity gap in STEM education and careers for people with visual impairments. The inability to independently read scientific papers and textbooks creates a substantial barrier to participation in STEM fields. For accessibility practitioners, the paper highlights that PDF accessibility remains far behind web accessibility in both awareness and tooling maturity. The proposed automation pipeline could dramatically reduce the cost and expertise required to produce accessible PDFs, potentially shifting remediation from a specialist task to something any author could accomplish. The math viewer concept for navigating complex formulas represents a promising alternative to the cognitive overload of linear alternative text descriptions. Organizations producing STEM publications should note that current automated tools are inadequate for complex documents and that manual expert remediation remains necessary until AI-based solutions mature.

Tags: PDF accessibility · document remediation · artificial intelligence · STEM accessibility · mathematical formulae · screen readers · document analysis · deep learning · assistive technology

Standards referenced: PDF/UA (ISO 14289-1:2014) · ISO 32000-1:2008 (PDF 1.7) · WCAG 2.1 · Section 508 · European Accessibility Act · MathML