← All reviews

Towards Generating Web-Accessible STEM Documents from PDF

Volker Sorge, Mark Lee, Sandy Wilkinson · 2020 · Proceedings of the 17th International Web for All Conference (W4A) · doi:10.1145/3371300.3383351

Summary

This paper presents a fully automated, web-based pipeline for converting PDF documents into accessible HTML and ePub formats, with a particular focus on STEM content containing mathematical formulas and tables. The authors address a fundamental problem in document accessibility: the vast majority of scientific and educational documents are distributed as PDFs, which are inherently inaccessible to screen reader users, especially when they contain mathematical notation. The system combines several existing tools into a coherent workflow. Pdf2htmlEX handles initial content extraction from PDF, preserving positional information. MaxTract, a formula recognition engine, identifies and converts mathematical expressions into MathML, which can then be spoken by assistive technologies. The pipeline uses DBSCAN clustering algorithms to analyze page layout, distinguishing between text blocks, formulas, tables, and decorative elements like watermarks or page borders. A key innovation is the system's ability to handle formulas that PDF generators split across multiple text elements — a common problem where a single equation might be stored as dozens of separate character fragments. The tool reconstructs these fragments into complete mathematical expressions. For tables, the system employs two detection strategies: one for tables with visible border lines (using graphical element analysis) and another for borderless tables (using spatial clustering of text positions). The entire conversion process runs as a web service where users upload PDFs and receive accessible HTML or ePub output, requiring no manual intervention or accessibility expertise from the user.

Key findings

Testing on over 700 pages of undergraduate mathematics textbooks and 150+ pages of research papers demonstrated that the system can handle real-world STEM documents with reasonable accuracy. Mathematical formula recognition achieved high fidelity when formulas were cleanly separated from surrounding text, though formulas embedded inline with text presented more challenges. The DBSCAN clustering approach proved effective at identifying page regions and distinguishing foreground content from background elements like watermarks, headers, and decorative borders. Table detection worked well for bordered tables but was less reliable for borderless layouts where spatial heuristics had to infer cell boundaries. The authors identified several remaining challenges: multi-column layouts can confuse the reading order algorithm, complex nested tables are difficult to reconstruct, and some PDF generators produce unusual character encodings that resist extraction. The web-based deployment model was validated as practical — users could convert documents without installing software or understanding accessibility requirements. Processing time was acceptable for documents up to approximately 50 pages, though larger documents required batch processing.

Relevance

This research addresses one of the most persistent barriers in digital accessibility: making STEM content available to people who use screen readers or other assistive technologies. Mathematical notation in PDFs is essentially invisible to assistive technology, locking blind and low-vision students out of textbooks, research papers, and educational materials. An automated conversion tool removes the burden from individual authors or institutions, who often lack the expertise or resources to create accessible STEM documents manually. For accessibility practitioners, this work demonstrates that full automation of PDF remediation is feasible for structured content types, though not yet perfect. Organizations producing large volumes of STEM materials — universities, publishers, government agencies — could integrate such a pipeline into their document workflows. The research also highlights the importance of MathML support in browsers and assistive technologies as a prerequisite for accessible mathematics on the web.

Tags: document accessibility · PDF conversion · STEM accessibility · mathematical formulas · automated remediation · HTML conversion · ePub

Standards referenced: WCAG 2.1 · MathML · ARIA