Corpus Studies in Word Prediction

Keith Trnka, Kathleen F. McCoy · 2007 · Proceedings of the 9th International ACM SIGACCESS Conference on Computers and Accessibility (Assets '07) · doi:10.1145/1296843.1296877

Summary

This paper from the University of Delaware investigates how the choice and combination of training corpora affects the performance of statistical word prediction systems for Augmentative and Alternative Communication (AAC) devices. The fundamental challenge is that AAC users communicate far below the rate of speech, and word prediction can help by reducing keystrokes needed. However, no large corpus of AAC-specific text exists for training language models, so researchers must approximate using other text sources. The authors evaluate word prediction using seven different corpora spanning spoken and written domains: AAC Email (emails from AAC users), Callhome (telephone conversations between family/friends), Charlotte (narrative conversations), SBCSAE (Santa Barbara face-to-face conversations), Micase (academic spoken English), Switchboard (prompted telephone conversations, ~2.9 million words), and Slate (online magazine, ~3.9 million words). Each corpus was preprocessed to approximate AAC text — reformatted to standard style, speech repairs removed (backchannels, repetitions, abandoned words), and punctuation normalised. The study uses keystroke savings as the primary evaluation metric, calculated on held-out test data.

Key findings

The most significant finding is that combining in-domain training data with out-of-domain data often produces better word prediction than either alone. A small amount of in-domain data combined with a much larger amount of dissimilar out-of-domain text can be more beneficial than a larger quantity of similar text alone. However, there is a threshold: when the in-domain and out-of-domain data are very different, out-of-domain training provides diminishing returns — Callhome (casual family conversation) saw no benefit from out-of-domain Slate (formal written text), even at 300 times the training data volume. Topic modeling proved portable across domains: a topic model trained on Switchboard improved keystroke savings significantly on all corpora except AAC Email, even when the testing domain differed substantially from Switchboard. Vocabulary analysis revealed that named entities (proper nouns) and out-of-vocabulary words are major factors affecting cross-domain performance. Slate, a written corpus with 12% named entities, performed poorly under out-of-domain training because named entity caching in practice would avoid many of these prediction misses. The corpus diversity analysis (OOV cross-validation) showed that Switchboard's vocabulary was the most evenly distributed across topics, while specialised corpora like AAC Email and Micase had higher self-OOV rates reflecting their focused vocabularies.

Relevance

This paper addresses a practical problem that remains relevant for AAC device development: how to build effective language models when the target population's text is scarce and difficult to collect. The finding that mixed-domain training often outperforms single-domain training has direct implications for developers of AAC prediction systems — rather than waiting for large AAC-specific corpora, they can leverage general-purpose text combined with whatever AAC data is available. The portability of topic modeling across very different text domains is particularly encouraging, suggesting that advanced NLP techniques can improve AAC prediction even when trained on dissimilar data. For practitioners, the paper highlights that keystroke savings (the standard metric) may not fully capture real-world benefit because factors like cognitive load of scanning prediction lists, speed of selection, and communication partner patience also affect whether word prediction actually improves communication rate. The vocabulary and named entity analysis provides practical guidance on when cross-domain training will and will not help.

Tags: word prediction · AAC · language model · natural language processing · corpus linguistics · keystroke savings · topic modeling · n-gram · augmentative communication · communication rate