The Intelligent Voice-Interactive Interface

Christopher Schmandt, Eric A. Hulteen · 1982 · Proceedings of the 1982 Conference on Human Factors in Computing Systems (CHI '82) · doi:10.1145/800049.801812

Summary

Schmandt and Hulteen describe the 'Put That There' system built at MIT's Architecture Machine Group (the precursor to the Media Lab), one of the earliest working implementations of a conversational, multimodal human–computer interface. Seated in a chair ten feet from a thirteen-foot rear-projected 'media room' display, a user wearing a head-mounted microphone and a wrist-mounted Polhemus six-degree-of-freedom position sensor builds and edits a graphical database — in the demo, a map of Caribbean shipping — by speaking natural commands and pointing at the screen. The speech recogniser was a Nippon Electric Company DP-100 configured for connected speech with a 120-word vocabulary; the gesture tracker sampled the user's wrist position and orientation 40 times a second. Output came back through a speech synthesiser, pre-recorded digitised audio, and graphical updates on the display. The paper's central argument is methodological: because speech recognition hardware will never be 100% accurate, designers should stop chasing raw accuracy and instead design for *effective accuracy* — the usefulness of the system in context. To that end, 'Put That There' stacks four complementary techniques: redundant input channels (voice + gesture, with planned eye-tracking), cascading syntactic and semantic analysis that loops back for clarification, context-sensitive interpretation grounded in the current database state, and rapid verbal and graphical feedback so errors are detected and corrected quickly. The paper concludes with a vision of generalised multimodal interaction in which any command function should be activatable by any input channel.

Key findings

The contribution is a working systems demonstration and the design principles that fall out of it, rather than a statistical study. The authors show that multimodal redundancy mitigates recognition error: when the speech recogniser hears 'move that' but gesture tracking has simultaneously captured a pointing vector, the system can resolve 'that' from the gesture even if the speech string is partially mis-recognised. Syntactic analysis uses command templates with re-entrant matching so that ambiguous input triggers targeted follow-up questions ('which one?' when multiple instances of a ship type exist, 'what object?' when no pointing vector is available). Semantic analysis cross-references the database to rule out meaningless actions — for example, refusing to 'move' a landmass while accepting movement of a ship. Crucially, the authors observe that users tolerate imperfect speech recognition far better when the system's follow-up questions reveal *which* part of the command it did and did not understand, rather than a generic 'please repeat that.' Graphical feedback — outlining the object the system thinks is selected — gives users early confirmation or a chance to abort. The system's two-channel implementation (voice + wrist pointing) demonstrates the concept, with eye-tracking identified as a planned third channel to enable even more natural 'move that [while looking at the bowl]' interactions.

Relevance

This paper is a foundational reference for every modern multimodal and voice-based accessible interface, from screen-reader + touch interaction on iOS and Android, to voice-controlled smart-home systems, to AAC devices that combine eye gaze with speech. The 'effective accuracy' framing is particularly important for accessibility practice: it reminds designers that perfect recognition is not required for a usable speech interface, provided the system supports error detection, targeted clarification, and multiple redundant input channels. The principle that any command should be available via any input channel is essentially a plain-English formulation of WCAG's input-modality independence and of ability-based design's Multiple Modalities heuristic, decades before those frameworks were formalised. Limitations: the paper reports no user study beyond the system demonstration, the vocabulary is tiny by modern standards, and the accessibility framing is implicit. Nonetheless, for practitioners working on voice interfaces, AAC, or multimodal AT, this is essential historical grounding.

Tags: speech recognition · voice interface · multimodal interaction · gesture recognition · historical · natural language interface · redundant input · accessible input