Multimodal User Input Patterns in a Non-Visual Context

Xiaoyu Chen, Marilyn Tremaine · 2005 · Proceedings of the 7th International ACM SIGACCESS Conference on Computers and Accessibility (Assets '05) · doi:10.1145/1090785.1090832

Summary

This exploratory study from the New Jersey Institute of Technology investigates how users choose between speech and hand (touchpad) inputs when performing tasks in a non-visual interface. The research used AudioBrowser, a system that organizes information items into hierarchies and allows access via both touchpad and speech inputs. The touchpad sensing area is divided into three tracks with embossed boundaries for information browsing and command display, while speech input uses a 54-command vocabulary. Fourteen sighted volunteers participated individually in a lab setting, receiving only auditory system outputs. Each subject was taught both input modalities, with training order counterbalanced — half learned speech first, half learned touchpad first. Subjects performed representative tasks using AudioBrowser while sessions were videotaped. A total of 1,462 input operations were analysed from the task-performance sessions, comprising 635 (43.4%) speech operators and 1,007 (68.9%) touchpad operators, with some operations counted in both when users combined modalities.

Key findings

The study revealed three key findings about non-visual multimodal input behaviour. First, users chose input modality based on the type of operation: navigation operations (moving through hierarchies, reading items) primarily used the touchpad (79% of 994 navigation operations), while non-navigation instructions (commands like spelling, increasing volume) primarily used speech (63% using speech). A paired t-test confirmed this pattern was statistically significant (t(13)=4.352, p<0.001). Second, multimodal error correction was surprisingly rare. When an operation failed, users overwhelmingly preferred to repeat the failed method or try other approaches within the same modality rather than switching modalities. Only 11% of the 150 total failures resulted in cross-modal correction without modality switching. Third, a training order effect existed: subjects who received speech training first tended to use more speech and less touchpad overall, though the modality learned first was not necessarily the primary modality used later. The touchpad-first group's modality usage was nearly significant (t(5)=1.368, p=0.071) toward more touchpad use.

Relevance

This study provides valuable empirical evidence for designing multimodal non-visual interfaces, a topic that remains highly relevant as voice assistants and touchscreen devices become ubiquitous. The finding that users naturally partition input modalities by task type — touchpad for navigation, speech for commands — offers a practical design principle: non-visual systems should align each modality's strengths with appropriate operation types rather than offering fully equivalent functionality across both. The surprising rarity of cross-modal error correction suggests designers should not rely on modality switching as a primary error recovery strategy; instead, each modality should have robust within-modality error handling. The training order effect has implications for onboarding and tutorial design for assistive technology. While the study used sighted participants (a limitation the authors acknowledge), it establishes baseline multimodal behaviour patterns that inform the design of accessible interfaces for people with visual impairments.

Tags: multimodal interaction · non-visual interaction · speech input · touchpad · visual impairment · dialogue design · input modalities · human-computer interaction