RegionSpeak: Quick Comprehensive Spatial Descriptions of Complex Images for Blind Users

Yu Zhong, Walter S. Lasecki, Erin Brady, Jeffrey P. Bigham · 2015 · Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI 2015) · doi:10.1145/2702123.2702437

Summary

This paper introduces RegionSpeak, a mobile application that helps blind users get comprehensive spatial descriptions of complex visual scenes by combining image stitching with parallelized crowdsourced labeling. The authors identify a gap in existing visual question-answering systems like VizWiz, which use a single-image, single-response model that struggles when users need information about spatial layouts or large sets of objects. RegionSpeak addresses this through three innovations: it allows users to capture wider visual areas by automatically stitching multiple photos into a panorama, it distributes labeling tasks to multiple crowd workers in parallel so each worker describes a single region of the image, and it provides an accessible touchscreen interface for blind users to spatially explore the labeled regions. The system was evaluated through a user study with 10 blind participants comparing stitching to single-photo interfaces, an analysis of crowd worker description quality using both iterative and parallelized workflows, and a live trial simulating real-world usage. The research builds on the VizWiz platform and Chorus:View system, positioning RegionSpeak as a middle ground between cheap single-question services and expensive continuous-interaction systems.

Key findings

The stitching interface significantly outperformed the single-photo approach, reducing task completion time by 35.4% (average 121.1 seconds vs. 187.4 seconds) and requiring fewer question-answer iterations (1.48 vs. 2.24 on average). All 10 blind participants preferred the stitching interface and found it easy to learn. The stitching algorithm achieved an 83.3% success rate with blind participants camera skills. For crowd worker descriptions, the parallelized approach yielded an average of 5.0 distinct items per session with 4.8 descriptive details and 4.6 spatial cues, compared to the iterative approach which produced more text but with less evenly distributed workloads. Workers in the parallel condition selected accurate bounding boxes 72% of the time, with only 10% rated as incorrect. In a live trial, 10 crowd workers produced descriptions covering 11 distinct objects with 12 spatial cues in about 1 minute and 5 seconds on average, comparable to VizWiz response times but with far richer spatial information.

Relevance

RegionSpeak demonstrates a practical approach to a persistent challenge in visual accessibility: helping blind users understand the spatial layout of complex environments. While the system relies on human crowd workers rather than AI, its design principles — parallelized description tasks, region-based labeling, and accessible spatial exploration interfaces — remain highly relevant to modern AI-powered image description tools. The touchscreen exploration interface, where users slide a finger across the screen to hear descriptions of spatially-anchored regions, offers a compelling interaction model for any system that generates spatial descriptions. For accessibility practitioners, this research highlights that comprehensive image descriptions need to go beyond listing objects to include spatial relationships and contextual details. The work also shows that breaking complex visual scenes into smaller regions produces more detailed and accurate descriptions, a principle applicable to both human and automated description workflows.

Tags: visual accessibility · crowdsourcing · image description · spatial awareness · blind users · touchscreen accessibility · assistive technology