Speech Dasher: A Demonstration of Text Input using Speech and Approximate Pointing

Keith Vertanen, David J.C. MacKay · 2014 · Proceedings of the 16th International ACM SIGACCESS Conference on Computers & Accessibility (ASSETS 2014) · doi:10.1145/2661334.2661420

Summary

This paper demonstrates Speech Dasher, a multimodal text entry system that combines speech recognition with the Dasher zooming interface to enable fast, corrected text input using only voice and gaze direction. The core problem addressed is that while speech dictation is fast, correcting recognition errors is slow and frustrating — especially when corrections are also made by voice, since recognizers tend to make the same mistakes repeatedly. Keyboard and mouse correction works but requires precise motor control that some users lack. Speech Dasher solves this by having users first speak their intended text, then navigate the Dasher interface (controlled via eye tracker or other approximate pointing device) to confirm correct words and fix errors. Dasher presents a zooming world of nested boxes where each box represents a letter, sized proportionally to its probability under a language model. The probability model is built from the speech recognizer's word lattice — a graph of word hypotheses with acoustic and language model scores. Primary predictions (most likely words) appear as large, easy-to-navigate boxes, while secondary predictions and the full alphabet are available through an "escape box" mechanism, ensuring any word can be written regardless of whether the recognizer predicted it. The system also handles out-of-lattice words by searching for paths allowing substitution or deletion errors to reconnect with the lattice.

Key findings

A longitudinal formative study with three users (American, British, and German accents) showed that after 6-8 training sessions, users achieved an average entry rate of 40 corrected words per minute with Speech Dasher, compared to 20 wpm with standard Dasher — a 2x speed improvement. This was achieved despite an average initial word error rate (WER) of 22% from the speech recognizer. Recognition accuracy varied significantly by accent: 7.8% WER for the American user (54 wpm), 12.4% for the British user (42 wpm), and 46.7% for the German user (23 wpm). Even on sentences containing at least one recognition error, users still achieved 30 wpm. Critically, users corrected virtually all errors — the final text WER was only 1.8% in Speech Dasher (comparable to 1.3% in standard Dasher). The system requires only approximate pointing, making it compatible with eye trackers and other low-precision input devices.

Relevance

Speech Dasher is highly relevant for people who can speak but cannot use conventional keyboards and mice — such as individuals with severe motor impairments who retain speech and eye movement. The 40 wpm rate with near-complete error correction makes it one of the faster assistive text entry methods demonstrated with eye tracking. The approach of combining two imperfect input channels (error-prone speech recognition plus imprecise gaze pointing) into a reliable text entry system is an elegant design pattern for assistive technology. The work also highlights how accent and language background significantly impact speech recognition accuracy and consequently text entry speed, an important consideration for deploying speech-based assistive tools internationally. Limitations include the small study size (three able-bodied users) and the several hours of training required.

Tags: speech recognition · eye tracking · gaze input · text entry · error correction · assistive technology · language model · multimodal interaction