Making Legacy Digital Content Accessible at Source

Sankalan Pal Chowdhary, Dipendra Manocha, M. Balakrishnan, Akashdeep Bansal, Himanshu Garg · 2019 · Proceedings of the 16th International Web for All Conference (W4A) · doi:10.1145/3315002.3332444

Summary

This demonstration paper presents a toolset for converting legacy-encoded Devanagari text to Unicode within Adobe InDesign, addressing a significant accessibility barrier for Indian language content. Despite Unicode being published in 1991, vast amounts of digitally created content in Indian languages remains locked in proprietary legacy font encodings (such as Walkman-Chanakya) that were used before Unicode support was widely available in publishing software. When this content was originally created, the target was print, so the encoding was irrelevant — the text displayed correctly through the font. However, when this same content is exported to digital formats (PDFs, EPUBs), it produces unsearchable, unreadable text that screen readers cannot interpret and search engines cannot index. The conversion challenge is non-trivial for complex scripts like Devanagari: legacy encodings not only use different code points but also different character ordering (e.g., the vowel sign for "i" appears before the consonant visually and in legacy encoding but after the consonant in Unicode), and legacy content often contains encoding inconsistencies that display correctly but break conversion. The authors developed InDesign scripts that perform the conversion in two stages: first mapping legacy character shapes to Unicode equivalents (using placeholder characters for those requiring repositioning), then applying regex-based repositioning rules to correct character order. The mapping and repositioning rules are stored in external tab-separated files, making the approach adaptable to different legacy encodings and scripts without code changes.

Key findings

The tools have been successfully used to convert legacy Devanagari content from 5 distinct legacy encodings across over 100 K-12 textbooks for Indian schools, produced in collaboration with CIET (Central Institute of Educational Technology) at NCERT (National Council of Educational Research and Training). Performing the conversion within InDesign (rather than through external tools) provides several advantages: it avoids the manual export/import cycle that introduces errors and layout disruption; InDesign's scripting API allows direct access to document content and objects; and the conversion retains most source formatting, though some text reflow requires manual post-processing. The tools are freely available on GitHub. The authors note that having access to InDesign document objects opens further possibilities: automated semantic markup (headings, list items) based on formatting attributes could make exported EPUBs even more accessible, beyond just fixing the character encoding. PageMaker documents can be imported into InDesign for conversion, extending the tool's reach to older content.

Relevance

This work addresses a foundational accessibility problem that is easy to overlook: if text is not encoded in Unicode, it is completely invisible to assistive technology. Screen readers cannot read it, search cannot find it, and text-to-speech systems cannot process it. For the millions of Hindi-speaking students with visual impairments in India, textbooks locked in legacy encodings are as inaccessible as scanned images of text. The scale of the problem is enormous — decades of Indian language publishing produced content in dozens of proprietary font encodings, and converting this legacy corpus is essential for digital accessibility. For accessibility practitioners, this paper highlights that accessible publishing is not only about adding alt text, ARIA attributes, or semantic structure — the text itself must first be in a universally recognized encoding. The approach of converting at the source (within the publishing application) rather than after export is strategically sound because it preserves formatting and enables additional accessibility enhancements like semantic markup. The modular design with external rule files means the approach can be extended to other Indic scripts (Bengali, Tamil, Gujarati, etc.) facing the same legacy encoding problem.

Tags: Unicode · legacy encoding · accessible publishing · Devanagari · India · screen reader · digital inclusion · educational content · InDesign · EPUB

Standards referenced: Unicode