The Information-Theoretic Analysis of Unimodal Interfaces and their Multimodal Counterparts

Melanie Baljko · 2005 · Proceedings of the 7th International ACM SIGACCESS Conference on Computers and Accessibility (Assets '05) · doi:10.1145/1090785.1090793

Summary

This paper applies Shannon Information Theory to formally analyse and quantify the hypothesised benefits of multimodal interfaces over unimodal ones, with particular focus on Augmentative and Alternative Communication (AAC) devices such as Voice Output Communication Aids (VOCAs). AAC devices provide text composition facilities — using icons, orthography, or other selection methods — combined with text-to-speech output to support people with little or no functional speech due to conditions like cerebral palsy, ALS, or paralysis. The author argues that previous comparisons of unimodal and multimodal systems, notably by Keates and Robinson (1998), used a flawed formula for calculating information rate that conflated it with a different metric. By applying the correct information-theoretic formulation — where information rate equals the reduction in uncertainty (entropy) at the receiver per unit time — Baljko demonstrates that the original criticism of multimodal interfaces was partly unfounded. The paper develops several computational applications (DeriveInfoRate, DerivePhiValue, DerivePhiSurface) to model classes of command recognition systems and explore the relationships between vocabulary size, mean system recall, production latency, and information rate for both unimodal and multimodal systems.

Key findings

The analysis reveals that Keates and Robinson's formula for information rate either overestimates or underestimates the actual information rate of systems by a range of -21.1% to +28.0%, fundamentally undermining their conclusion that multimodal systems offer little advantage. Using the correct Information Theory formulation, the paper establishes a "10-25" heuristic: for systems with a small vocabulary of around 6 commands, a multimodal counterpart needs less than 10% improvement in system recall (recognition accuracy) to match the information rate of its unimodal counterpart, provided the multimodal input actions take no more than 25% longer to produce. For larger vocabularies (M=12), less than 10% improvement in recall (specifically 9.7%) is needed if production latencies are no more than 20% slower. This is significant because semantically redundant multimodal input — where two input modes both signal the same command — exploits redundancy to improve recognition accuracy. The gesture recognition community has shown that such redundancy can be practically exploited, though Keates and Robinson's own multimodal system unfortunately failed to achieve this, showing a 9.2% decrease in recall rather than improvement.

Relevance

This paper provides a rigorous mathematical foundation for evaluating whether adding input modalities to AAC and other assistive technology interfaces is actually beneficial — a question that has practical implications for device design and procurement. For AAC practitioners and developers, the key insight is that multimodal input does not need dramatic improvements in recognition accuracy to justify the increased production cost of more complex input actions; even modest recall improvements can offset slower input speeds. The framework is generalisable beyond AAC to any command recognition system where users might benefit from multiple input channels — including voice control, gesture recognition, eye tracking, and switch access interfaces commonly used by people with physical disabilities. The work also cautions against drawing conclusions from flawed metrics, a reminder that remains relevant as the assistive technology field continues to evaluate increasingly complex multimodal interfaces.

Tags: multimodal interfaces · AAC · information theory · voice output communication aids · speech generating devices · interface evaluation · communication disorders