VizLens: A Robust and Interactive Screen Reader for Interfaces in the Real World

Anhong Guo, Xiang "Anthony" Chen, Haoran Qi, Samuel White, Suman Ghosh, Chieko Asakawa, Jeffrey P. Bigham · 2016 · Proceedings of the 29th Annual Symposium on User Interface Software and Technology (UIST 2016) · doi:10.1145/2984511.2984518

Summary

This paper introduces VizLens, a system that functions as a screen reader for physical interfaces in the real world, enabling blind people to independently use inaccessible appliances like microwaves, thermostats, kiosks, and checkout terminals. VizLens deeply integrates crowdsourcing with real-time computer vision in a two-phase approach. In the first phase, a blind user photographs an unfamiliar interface and sends it to crowd workers on Mechanical Turk, who segment the interface region and label each button or control element in parallel. In the second phase, when the user wants to operate the interface, the VizLens app uses SURF feature detection and RANSAC-based perspective transformation to match the live camera feed against the crowd-labeled reference image in real-time (8 fps with 200ms latency). The system detects the user's fingertip via skin color thresholding in HSV color space and speaks the label of whatever button is beneath their finger. VizLens offers two interaction modes: feedback mode (speaks what is under the finger as the user explores) and guidance mode (provides directional instructions to navigate to a selected target button). A formative study with 6 blind participants identified key design requirements, and the system was implemented as an iOS app with a GPU-accelerated backend on AWS.

Key findings

In a user study with 10 blind participants using an inaccessible microwave, VizLens achieved 96% button identification accuracy across 250 controlled trials, with all errors concentrated in the top region where the user's hand occluded too many feature points. For locating tasks, guidance mode achieved a significantly higher completion rate (98%) than feedback mode (82%), while completion times were similar (~55 seconds). For simulated cooking tasks, both modes performed well with no significant differences (90% vs 100% completion, ~102 vs ~120 seconds). The crowdsourcing workflow was fast (8 minutes average), highly accurate (99.7% of buttons correctly labeled), and cheap ($1.15 per interface). VizLens worked robustly across diverse skin colors and lighting conditions, and was successfully tested on 10 different interface types including microwaves, printers, copiers, thermostats, toasters, vending machines, remote controls, and laser cutters. VizLens v2 extended the system with state detection for dynamic interfaces (e.g., a 6-screen coffeemaker), OCR integration for reading LCD displays, a touchscreen spatial layout for mental model building, and initial support for head-mounted cameras (Google Glass) to free the user's hands.

Relevance

VizLens represents a significant advance in making physical interfaces accessible to blind people, solving a problem that affects daily independence — from using a microwave to operating an office copier. The system's architecture demonstrates a powerful design pattern: use human intelligence (crowdsourcing) for the hard one-time task of understanding an interface, then use computer vision for the repeated real-time task of providing interactive feedback. This "crowd once, use many times" model is both cost-effective and scalable, as reference images labeled for one user can benefit all subsequent users of the same device. For accessibility practitioners, VizLens highlights that physical interface accessibility remains a critical gap even as digital accessibility improves — fewer than 10% of blind Americans read Braille, making tactile labels insufficient as a universal solution. The system's extensions to dynamic interfaces (state detection, OCR for LCD displays) address the increasingly complex interfaces found on modern appliances. The finding that 56.7% of photos taken by blind participants failed quality checks underscores the ongoing challenge of blind photography and the potential value of wearable cameras for accessibility applications.

Tags: visual accessibility · computer vision · crowdsourcing · blind users · physical interfaces · screen reader · assistive technology