Transforming Japanese Archives into Accessible Digital Books

Tatsuya Ishihara, Toshinari Itoko, Daisuke Sato, Asaf Tzadok, Hironobu Takagi · 2012 · JCDL '12: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries · doi:10.1145/2232817.2232836

Summary

This JCDL 2012 paper from IBM Research Tokyo and Haifa tackles a problem that remained invisible in most English-focused digitisation work: producing fully accessible digital books from Japanese archives, where the writing system uses more than 10,000 characters (kanji plus kana) and older books contain archaic glyphs that sit outside Unicode or require handwritten character recognition to identify. The authors describe EBIS (E-Book Improvement System), a pipeline that converts scanned physical books into DAISY and EPUB deliverables through four automatable-plus-human steps: OCR, text correction, structurisation (adding headings, tables of contents, page numbers, running titles, and navigation metadata), and final DAISY generation. Two architectural choices underpin the work. First, the system preserves the original page images throughout — every character has a rectangle on the page and an external-metadata link back to it, encoded in an extended ALTO XML format that adds Japanese-specific writing-mode attributes. This allows gradual refinement: operators can revisit a book months later and correct errors without losing earlier work. Second, the work is fragmented into micro-tasks and dispatched to a volunteer crowd via web interfaces: CONCERT for character-level OCR correction and CMAS (Collaborative Metadata Authoring System) for page-level correction and structure authoring. Twenty-one volunteers trialled the system with the Japanese National Diet Library across 23 books spanning 1870 to 2003 (3,021 pages), and the authors measured the per-step workload as a function of publication year and character familiarity.

Key findings

Digitisation workload scaled sharply with book age: older Japanese books required significantly more effort at every step. Statistically significant negative correlations were found between the year of publication and the log-workload for layout correction (r = -0.67, p = 0.0009), CONCERT OCR correction (r = -0.67, p = 0.001), page session correction (r = -0.80, p < 0.001), and structurisation (r = -0.63, p = 0.002). On average, CONCERT OCR correction consumed the most operator time — 4.89 hours per 10,000 characters (SD = 3.46, max 12.1 hours), followed by page session (3.31h) and structurisation (2.46h); layout correction was relatively cheap at 0.94h. Character familiarity — measured against Japanese Wikipedia frequencies — was weakly but significantly correlated with per-character correction time (r = -0.31, p = 2.2e-16), confirming that archaic characters drive the heaviest workload. Table-of-contents pages took significantly more structurisation time than any other page type (F(4,882) = 73.5, p < 0.010), with blank pages the fastest. The authors concluded that workload estimation for entire archives is not yet reliable without additional parameters (paper quality, font style, grammar, layout uniqueness, operator skill) and that crowdsourcing structurisation — not just OCR correction — is feasible but needs simpler interfaces.

Relevance

This paper matters for anyone producing accessible ebooks outside the English-only mainstream. It makes concrete what 'Asian-language accessibility' actually costs in operator hours, which is essential when scoping accessible-content projects for publishers, libraries, or blind-readers' associations serving non-Latin-script audiences. The preserved-page-image architecture is directly applicable to any digitisation workflow that expects to iterate on quality over time, and the ALTO-extension work informed later international efforts to support Japanese, Chinese, Korean, and other complex scripts. Practitioners should note the finding that table-of-contents structurisation is the single hardest navigation-metadata task — relevant to prioritising automation effort in accessible-document toolchains. The EBIS system became the foundation for the Japanese Braille Library crowdsourcing work reported later by the same group (Kobayashi et al. 2015). Limitations include the small pilot (23 books, mostly first 60 pages), the absence of end-user (blind reader) quality evaluation, and the pre-standardised state of EPUB Japanese-language features at the time.

Tags: accessible ebooks · digitization · japanese · OCR · DAISY · EPUB · crowdsourcing · print disability · structural metadata · ALTO · kanji · national libraries

Standards referenced: DAISY · EPUB · ALTO · ANSI/NISO Z39.86-2005 · CSS Writing Modes Module Level 3