Improving accessibility to mathematical formulas: the Wikipedia math accessor

Leo Ferres, Jose Fuentes Sepúlveda · 2011 · Proceedings of the International Cross-Disciplinary Conference on Web Accessibility (W4A) · doi:10.1145/1969289.1969322

Summary

This paper presents MathAcc, an assistive technology system that generates natural language descriptions in Spanish for the more than 355,000 mathematical formulas found across 26,174 Wikipedia articles. Wikipedia renders formulas as rasterised PNG images of LaTeX expressions, with the raw LaTeX code stored in the alt attribute — making formulas essentially inaccessible to screen reader users, since LaTeX is difficult to follow aurally and not necessarily known to content seekers. MathAcc works through a three-stage pipeline: first, it detects and curates LaTeX expressions from Wikipedia HTML (handling inconsistencies between LaTeX and HTML encoding); second, it translates the curated LaTeX into content MathML using the SnuggleTex library, which disambiguates semantic meaning (e.g., whether f^{-1} means the inverse function or f to the power of -1); and third, a template-based natural language generation (NLG) system produces Spanish-language descriptions using a stack-based algorithm that traverses the MathML tree. The linguistic templates were derived from a crowdsourced study with 38 participants who provided natural language descriptions of 21 mathematical concepts.

Key findings

The NLG system successfully generated linguistic descriptions for approximately 66% of Wikipedia formulas tested across samples of 100, 500, and 1,000 randomly selected formulas — remarkably consistent across sample sizes. Of the 66 formulas given descriptions in the 100-formula sample, only 6 (approximately 10%) contained errors in the generated text. The 34% failure rate was primarily due to limitations in SnuggleTex's LaTeX-to-MathML translation, not the NLG system itself. The crowdsourcing study of mathematical sub-language revealed strong consensus on how people naturally verbalise most mathematical operators — there was marked variability in syntax but consistency in semantics, with the most common template typically representing 40-65% of responses. Some operators proved problematic: the tensor product symbol was highly ambiguous, and the "otimes" symbol was largely unknown to participants. The system adds pauses between semantically self-contained units to help text-to-speech grouping. The work specifically targets Spanish speakers, addressing the scarcity of domain-specific assistive technologies for non-English languages.

Relevance

Mathematical accessibility remains one of the most challenging areas of digital inclusion, particularly for STEM education. This work highlights a fundamental problem: mathematical notation is inherently visual and two-dimensional, while screen readers are sequential and linear. Simply reading LaTeX aloud is inadequate — users need natural language descriptions that convey meaning, not markup syntax. The approach of deriving verbalisation templates from how humans naturally describe formulas, rather than from technical notation rules, produces more intuitive output. The Spanish-language focus addresses a critical gap: most math accessibility tools target English, leaving the 329 million native Spanish speakers underserved. The work also demonstrates the scale of the problem — over 350,000 formulas on Wikipedia alone — and the value of automated approaches. While MathML and tools like MathJax have since improved browser-level math accessibility, the challenge of providing meaningful spoken descriptions of complex formulas persists.

Tags: mathematical accessibility · blindness · natural language generation · MathML · Wikipedia · STEM accessibility · multilingual accessibility · screen readers

Standards referenced: MathML