Automatically Generating and Improving Voice Command Interface from Operation Sequences on Smartphones

Lihang Pan, Chun Yu, JiaHui Li, Tian Huang, Xiaojun Bi, Yuanchun Shi · 2022 · Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems · doi:10.1145/3491102.3517459

Summary

This paper presents AutoVCI, a system that automatically generates voice command interfaces (VCIs) for smartphone tasks from recorded touch operation sequences, enabling hands-free and eyes-free interaction without requiring programming expertise, corpora, or hand-written rules. A VCI designer simply demonstrates how to complete a task via touch (e.g., making a video call in Messenger involves launching the app, clicking People, selecting a contact, and clicking Video Call), and AutoVCI generates a voice interface that maps natural language commands to those GUI operations. The system uses pre-trained BERT models to calculate semantic vectors from GUI element text, matches user commands to tasks via cosine similarity, identifies parameters (list parameters from GUI lists, text parameters from input fields) during runtime execution, and launches complementary Q&A dialogues when ambiguity arises. A key innovation is semantic accumulation: the system learns from each interaction, storing command templates and updating semantic vectors so that subsequent commands for the same task require fewer clarification questions. The system was implemented on Android using the Accessibility Service API and tested across 11 applications with 45 tasks.

Key findings

In Phase 1 with 16 participants, AutoVCI achieved a 98.4% success rate across 640 commands. The system demonstrated continuous self-improvement: success rates rose from 95% for the first user to 100% from the eleventh user onward, and average extra dialogue rounds decreased from 2.1 to 0.7. By the last participant, over 40% of commands executed directly without any extra dialogue, and 80% required at most one round. Generating the VCI for 45 tasks across 11 apps took only 31 minutes total (about 13 seconds per task to record operation sequences plus 21 minutes to inspect and correct). Parametric operation identification accuracy was 98.5%. In Phase 2 with 67 online participants, the semantic accumulation continued improving: about 75% of commands could be matched by templates in the last measurement window, with only 0.4 extra dialogue rounds on average. Subjective feedback on a 7-point scale averaged 6.06 for question understandability and 6.0 for willingness to use. Notably, only 6 of the 45 tasks were supported by Siri or Google Assistant, demonstrating AutoVCI's ability to extend voice control to the vast majority of smartphone tasks that commercial assistants cannot handle.

Relevance

This research has significant accessibility implications despite not being explicitly framed as accessibility work. The ability to create voice command interfaces for any smartphone task without programming directly addresses the needs of users with motor disabilities, visual impairments, or situational impairments who need hands-free and eyes-free interaction. The use case scenario in the paper explicitly describes a non-programmer creating a VCI for elderly parents who have difficulty using smartphone touchscreens — a common accessibility use case. The programming-by-demonstration approach dramatically lowers the barrier to creating custom voice interfaces, potentially enabling caregivers, assistive technology specialists, or users themselves to voice-enable any app task. The system's use of Android's Accessibility Service API to read GUI layouts and simulate touch events builds on the same infrastructure used by screen readers, suggesting natural integration opportunities. For the accessibility community, AutoVCI represents a shift from relying on app developers to implement accessibility toward empowering end users to create their own accessible interaction pathways.

Tags: voice command interface · mobile accessibility · speech recognition · natural language understanding · programming by demonstration · smartphone automation · machine learning · GUI accessibility