Pushpak: Voice Command-based eBook Navigator

Shradha Holani, Akashdeep Bansal, M Balakrishnan · 2019 · Proceedings of the 16th International Web for All Conference (W4A) · doi:10.1145/3315002.3332445

Summary

This demonstration paper presents Pushpak, a voice command-based eBook navigator designed to reduce the steep learning curve associated with screen reader software. The authors identify a key accessibility barrier: effective use of screen readers like NVDA and JAWS requires memorizing numerous keyboard shortcuts and touch gestures, which is particularly challenging for beginners and even more difficult on touchscreen devices where gesture sets are limited and lack semantic intuition. For STEM digital books containing mathematical equations, tables, diagrams, and flowcharts, the number of required navigation commands increases substantially, compounding the cognitive load. Pushpak addresses this by allowing users to speak natural language commands rather than remembering specific keystrokes or gestures. The system architecture has three modules: speech-to-text conversion using Google Automatic Speech Recognition, an intent understanding module using Microsoft LUIS (Language Understanding Intelligent Service) that interprets the user's meaning regardless of exact phrasing (e.g., "skip this paragraph" and "go to the next paragraph" are understood as the same intent), and a command execution module that carries out the action. Two execution approaches were tested: direct NVDA function calls via an add-on, and keystroke simulation using Python's keyboard module. The system was developed as part of the RAVI (Reading Assistant for Visually Impaired) project at IIT Delhi's AssisTech Lab, funded by India's Ministry of Human Resource Development.

Key findings

Time analysis averaged over 5 voice commands showed that the total processing time was approximately 6-7 seconds per command (4.76 seconds for speech-to-text, 1.7 seconds for intent understanding, and 0.1-0.15 seconds for command execution), compared to 0.15 seconds for a direct keystroke press. The keystroke simulation approach was slightly faster than the NVDA function call approach (6.07 vs 6.86 seconds total) and has the advantage of being screen-reader-independent, requiring only a mapping table of keystrokes. The function call approach requires integration into the screen reader itself, making it impractical for proprietary screen readers. The authors identified several challenges: internet dependency for both speech recognition and intent understanding, latency that makes the system significantly slower than keyboard shortcuts, and the need for on-device models to match cloud-based accuracy. Future enhancements identified include context-dependent commands (e.g., "skip this too" referencing a previous command), handling multiple commands in a single utterance, and extending beyond Windows eBook navigation to Android and other applications.

Relevance

This proof-of-concept addresses a genuine usability barrier in assistive technology: the cognitive overhead of memorizing screen reader commands. While experienced screen reader users develop fluency with keyboard shortcuts, beginners — and particularly those navigating complex STEM content — face a significant learning curve that can discourage adoption. The approach of using natural language intent understanding rather than requiring exact command phrases is more forgiving and intuitive, aligning with how voice assistants like Siri and Google Assistant already work in the mainstream. For accessibility practitioners, this work highlights the potential of applying natural language processing to make assistive technology more approachable. The trade-off between speed (keyboard shortcuts are ~40x faster) and discoverability (voice commands require no memorization) is a classic accessibility tension. The system could be most valuable as a learning scaffold — helping new screen reader users navigate while they gradually learn keyboard shortcuts for efficiency. The broader applicability to motor impairments and hands-busy scenarios also illustrates how accessible design benefits diverse user groups beyond the primary target audience.

Tags: voice control · screen reader · NVDA · eBook · natural language understanding · visual impairment · STEM accessibility · assistive technology · speech recognition · touchscreen accessibility