GestureVoice: Enabling Multimodal Text Editing for Blind Users Using Gestures and Voice

Prerna Khanna, Sai Pravallika Reddy, IV Ramakrishnan, Xiaojun Bi, Aruna Balasubramanian · 2025 · ASSETS 2025: 27th International ACM SIGACCESS Conference on Computers and Accessibility · doi:10.1145/3663547.3746388

Summary

This paper introduces GestureVoice, a multimodal text editing system that enables blind smartphone users to edit text using mid-air hand gestures detected by a smartwatch combined with voice commands, eliminating the need for touchscreen interaction. The research addresses a well-documented pain point: while blind users can dictate text reasonably well using speech-to-text, correcting errors in that text using traditional screen reader methods like VoiceOver is extremely cumbersome, requiring dozens of swipes and multiple steps for even a single correction. A formative study with three blind participants confirmed that standard VoiceOver text editing required an average of 54 swipes and involved navigating complex rotor menus, selecting granularity levels, and performing precise touch gestures—all without visual feedback. GestureVoice reimagines this workflow through three integrated components. First, mid-air hand gestures recognized via the Apple Watch accelerometer and gyroscope allow users to select navigation granularity (character, word, or sentence level) through intuitive tilt and flick motions classified by a Random Forest algorithm. Second, an adaptive crown cursor uses the Apple Watch digital crown for precise navigation within the selected granularity, with speed that dynamically adjusts based on text length. Third, voice commands handle the actual corrections—users simply say "delete," "insert [text]," "replace [text]," or "change [text]" to make edits. The system was evaluated with eight blind participants across 160 trials involving three common text error types: substitutions, insertions, and omissions.

Key findings

GestureVoice achieved a 53.80% reduction in total text editing time compared to default screen reader editing, cutting average correction time from roughly 17 minutes to about 6.5 minutes for a document with 10 errors. Selection time improved by 39.69% and correction time by 40.66%. The system achieved 100% task completion rate versus 98% for default VoiceOver editing. Performance improvements were consistent across error types: substitution corrections were 66% faster (42 vs. 127 seconds), insertion corrections 66% faster (32 vs. 95 seconds), and omission corrections 49% faster (42 vs. 83 seconds). Gesture detection accuracy averaged 97.5% across participants, with gesture latency of just 11 milliseconds on the smartphone. Word-level navigation was overwhelmingly preferred, used in 98.75% of trials. Delete commands were most frequent (41.3% of corrections), followed by insert (25.8%) and replace (24.7%). The LLM-based auto-correction feature saw minimal use (2.5% of trials), suggesting users need more time to adopt AI-assisted features. All participants preferred GestureVoice over default screen reader editing, rating it higher on ease of use, physical demand, mental demand, error navigation, and error correction on a 7-point scale. Users reported the system was less physically and mentally demanding, with mean preference score of 6.75 out of 7 for choosing GestureVoice over the default rotor.

Relevance

Text editing remains one of the most frustrating daily tasks for blind smartphone users, and GestureVoice demonstrates that combining gesture-based navigation with voice commands can dramatically improve this experience. The 53.80% time reduction is substantial enough to change how blind users approach everyday writing tasks like composing messages, emails, and social media posts. The research validates an important design principle: separating navigation (gestures) from editing actions (voice) reduces cognitive load by giving each modality a clear, distinct purpose. For practitioners, this work highlights the untapped potential of commodity smartwatches as accessibility input devices—no special hardware is needed. The finding that word-level navigation dominated usage (98.75%) offers practical guidance for designing text navigation interfaces for blind users. Limitations include the small sample size (8 participants), controlled laboratory conditions, and potential challenges with voice input in noisy or public environments. Future work should explore customizable gestures, alternative wearables like rings, and real-world mobile usage scenarios.

Tags: text editing · blind users · multimodal interaction · gesture recognition · voice commands · smartwatch · wearable technology · screen readers · mobile accessibility · eyes-free interaction