The Use of Gestures in Multimodal Input
Simeon Keates, Peter Robinson · 1998 · Proceedings of the Third International ACM Conference on Assistive Technologies (Assets '98) · doi:10.1145/274497.274505
Summary
This paper from the University of Cambridge describes the development and evaluation of a prototype multimodal input system designed for users with motion impairments, for whom standard keyboard and mouse arrangements are often unusable. The system combined two gestural input channels — head gestures tracked by a Polhemus sensor and hand gestures via an analogue joystick — to explore whether multimodal input could improve computer interaction for this population. The researchers implemented three control strategies: Single Mode (using one channel at a time), Both Modes Same (producing the same gesture simultaneously on both channels to improve recognition accuracy), and Both Modes Different (producing different gestures on each channel to increase vocabulary size). The gesture vocabulary included six gestures — four directional (UP, DOWN, LEFT, RIGHT) and two oscillatory (YES, NO) — recognized using the Jester software, which employed a hybrid algorithm combining dynamic time warping with heuristic rules. Extensive user trials were conducted over three months at the Papworth Trust with seven participants who had conditions including athetoid cerebral palsy, tetraplegia, muscular dystrophy, spastic quadriplegia, and Friedrich's ataxia. The study used a scoring system where correct recognitions earned +1, non-recognitions scored 0, and misrecognitions scored -1, scaled to a maximum of 100.
Key findings
The trials produced a striking and counterintuitive result: multimodal input performed worse than single-mode input across all measures. Single Mode head gestures achieved the highest reliability scores, following the Power Law of Practice learning curve. Both Modes Same actually degraded performance because producing two simultaneous physical motions created interference and increased cognitive load. Both Modes Different proved so demanding that users could not produce gestures simultaneously — they had to perform them sequentially instead. Peak information transfer rates were 0.72 bits/second for single-mode head input and 0.77 for hand input, compared to only 0.65 for Both Modes Same and 0.56 for Both Modes Different. A three-gesture vocabulary consistently outperformed the six-gesture set, demonstrating that increased vocabulary size imposes cognitive costs that outweigh throughput gains. Under cognitive overload, users abandoned visual screen prompts entirely, relying on auditory prompts from the operator because spoken instructions were easier to process. The researchers found that heuristic rules for combining multimodal data could partially compensate for errors, but could not overcome the fundamental problem of excessive user load.
Relevance
This paper delivers an important cautionary message for assistive technology designers: more input channels do not automatically mean better interaction. The finding that physical and cognitive loads can quickly become excessive and detrimental challenges the assumption that multimodal input inherently benefits users with disabilities. For practitioners designing accessible input systems today, the key takeaway is that regular user trials with the target population are essential — assumptions about what should work often fail in practice. The study also highlights the critical importance of cognitive load as a design consideration, not just physical accessibility. The observation that users preferred auditory over visual prompts under high cognitive load has implications for how assistive interfaces should present instructions and feedback. Though the specific technology is dated, the fundamental insights about the tension between input flexibility and cognitive demand remain highly relevant to modern multimodal and gesture-based interfaces.
Tags: gesture recognition · multimodal input · motor impairment · user studies · cognitive load · head tracking · assistive technology · human-computer interaction