Adapting web content for low-literacy readers by using lexical elaboration and named entities labeling

Willian M. Watanabe, Arnaldo Candido, Marcelo A. Amâncio, Matheus de Oliveira, Thiago A. S. Pardo, Renata P. M. Fortes, Sandra M. Aluísio · 2010 · Proceedings of the 2010 International Cross Disciplinary Conference on Web Accessibility (W4A) · doi:10.1145/1805986.1805998

Summary

This paper presents Educational FACILITA, a browser-based tool that uses Natural Language Processing (NLP) to automatically adapt web content for low-literacy readers. Developed at the University of São Paulo as part of the PorSimples text simplification project for Brazilian Portuguese, the tool addresses the significant challenge that 28% of the Brazilian population are functionally illiterate, with only 25% reaching advanced literacy. Educational FACILITA employs two NLP techniques: lexical elaboration, which identifies complex words in web text and provides simpler synonyms, and named entity recognition and classification, which identifies proper nouns (people, places, organizations, events) and links them to Wikipedia definitions. The tool works as a Firefox Jetpack extension that automatically extracts the main textual content from web pages using a readability module, processes it through server-side NLP services, and presents the reading assistance directly within the original web page without changing its design or functionality. Complex words are identified by checking against three dictionaries of simple words (common words for youngsters, frequent words from children's news, and concrete words), and synonyms are drawn from two Portuguese thesauruses (TeP 2.0 with 45,000 words in 20,000 synonym clusters, and PAPEL with 100,000 words). The tool uses part-of-speech tagging to handle word ambiguity and ranks synonym suggestions by web frequency as a proxy for simplicity.

Key findings

The tool's architecture demonstrates a practical approach to combining client-side browser integration with server-side NLP processing. The lexical elaboration module tokenizes text, tags parts of speech, checks words against simple-words dictionaries, and when complex words are found, retrieves simpler synonyms ordered by web frequency. The named entity recognition module, built on the Rembrandt system adapted for low-literacy users (with simplified category names — e.g., "historical event" instead of "ephemeris"), achieves about 57% general accuracy and enriches entities with Wikipedia definitions. A statistical test confirmed that 73.5% of Wikipedia articles begin with definition sentences, validating this approach. The interaction model was designed for users with minimal computer experience: users activate the tool, complex words and named entities are highlighted in context on the original page, and clicking reveals synonyms or definitions in tooltip-style popups. The approach differs from text simplification (which rewrites content) by preserving the original text and adding explanatory layers, following educational research suggesting that elaboration is more favorable than simplification for vocabulary acquisition. The system was designed to be language-portable — replacing the NLP modules would adapt it for other languages.

Relevance

This research addresses a critical but often overlooked dimension of web accessibility: readability and comprehension barriers faced by people with low literacy skills. While WCAG includes success criteria on reading level (3.1.5) and unusual words (3.1.3), practical tools for automatically making web content more comprehensible have been scarce. The lexical elaboration approach — presenting simpler synonyms for complex words on demand — is particularly noteworthy because it preserves the original content while scaffolding understanding, supporting learning rather than merely bypassing difficulty. This aligns with modern approaches to cognitive accessibility that emphasize providing multiple representations of information. The work is especially relevant for the Global South, where functional illiteracy rates remain high and digital inclusion efforts must contend with language barriers that standard accessibility tools (primarily developed for English) do not address. For practitioners, the research highlights that accessibility extends beyond sensory and motor accommodations to include cognitive and literacy support — a principle that has gained increased recognition with the W3C's Cognitive and Learning Disabilities Accessibility Task Force. The NLP techniques demonstrated here have become far more capable with modern language models, suggesting significant potential for AI-powered reading assistance tools.

Tags: low literacy · natural language processing · text simplification · lexical elaboration · named entity recognition · content adaptation · reading accessibility · cognitive accessibility · digital inclusion

Standards referenced: WCAG 1.0 · WCAG 2.0