Adapting Word Prediction to Subject Matter without Topic-Labeled Data

Keith Trnka · 2008 · Proceedings of the 10th International ACM SIGACCESS Conference on Computers and Accessibility (Assets '08) · doi:10.1145/1414471.1414556

Summary

This paper proposes a method for adapting word prediction systems to the current topic of discourse without requiring human-labeled topic categories in the training data. Word prediction is a key feature of Augmentative and Alternative Communication (AAC) devices, where it reduces the effort of producing text by predicting desired words and allowing users to select them with fewer keystrokes. Standard prediction systems use n-gram language models trained on text corpora, but these models are best suited for the predominant topics in their training data and perform poorly when the user writes about different subjects — for example, a model trained on school textbooks may predict well for academic topics but fail for home conversations. Previous approaches to topic adaptation required manually labeling training texts by topic, which is labor-intensive and limits the text collections that can be used. The author's approach treats each document in the training corpus as its own topic, computing a similarity score between the current text being typed and each training document. The language model then weights predictions from documents most similar to the current context more heavily, using a trigram model adapted through normalized similarity scores.

Key findings

In-domain evaluation (training and testing on the same corpus) showed that document-based topic modeling improved keystroke savings on all corpora compared to a baseline trigram model, though the improvement was statistically significant only on Callhome. On the Switchboard corpus, the document-topic approach achieved 61.42% keystroke savings compared to 60.35% for the baseline trigram — comparable to the 61.48% achieved by human-annotated topics in prior work. The more realistic mixed-domain evaluation (adding texts from all corpora into training) showed statistically significant improvements across all corpora. The best mixed-domain results appeared on Switchboard (61.17% keystroke savings, +1.37% improvement), demonstrating the model's ability to focus on the most relevant training documents while leveraging all available data. Testing also included a small collection of AAC user emails, where the method improved keystroke savings from 48.92% to 49.35%. The approach requires only a collection of documents — no topic labels — making it applicable to any unlabeled text corpus.

Relevance

This work addresses a practical challenge in AAC technology: word prediction systems that offer contextually inappropriate suggestions slow down communication rather than speeding it up. For AAC users who already face significant communication rate barriers, every percentage point of keystroke savings translates to meaningful gains in conversation speed and reduced fatigue. The document-as-topic approach is particularly valuable because it removes the need for manual topic labeling, allowing AAC systems to adapt using whatever text collections are available — including the user's own prior writing. This anticipates the personalization capabilities that modern predictive text systems now offer. For accessibility practitioners and AAC developers, the findings reinforce that topic-aware language modeling is achievable without expensive human annotation, lowering the barrier to building better-adapted prediction systems for diverse communication contexts.

Tags: word prediction · AAC · language model · topic modeling · natural language processing · keystroke savings · communication rate · text entry