Answering Visual Questions with Conversational Crowd Assistants

Walter S. Lasecki, Phyo Thiha, Yu Zhong, Erin Brady, Jeffrey P. Bigham · 2013 · Proceedings of the 15th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS 2013) · doi:10.1145/2513383.2517033

Summary

This paper introduces Chorus:View, a system that enables blind users to get visual questions answered through continuous conversational interaction with multiple crowd workers viewing a live video stream from the user's mobile device. The system addresses key limitations of existing visual question-answering tools like VizWiz, which rely on single static images and one-off interactions with individual workers. Analysis of one month of VizWiz data revealed that 18% of questions were part of sequential query sequences (averaging 3.7 questions over 10 minutes), where users needed to ask multiple related questions — a pattern poorly served by single-image approaches. Common tasks requiring sequential interaction include reading food packaging (finding the product, reading the type, then reading cooking instructions), locating dropped objects, and navigating using visual cues. Chorus:View streams video and audio from the user's iPhone to crowd workers recruited via Mechanical Turk, who collaborate through a group chat interface. Workers can both generate and vote on responses, with an incentive mechanism that rewards agreement. The system uses OpenTok for video streaming, allows workers to capture still screenshots for detail examination, and delivers responses via VoiceOver text-to-speech. The design evolved through iterative user-centered development including preliminary tests with 6 blind users and wizard-of-oz studies with student workers.

Key findings

Chorus:View achieved a 95% task completion rate compared to only 40% for VizWiz across product detail, sequential information finding, and navigation tasks — a statistically significant difference (F(1,36) = 22.22, p < .001). Using multiple crowd workers reduced mean time to first response from 45.1 seconds (single worker) to 17.3 seconds — a significant 61.6% drop (p < 0.05). Average time to final response decreased from 222.5 seconds to 111.7 seconds (49.8% improvement, p < 0.05). In the user study with 10 blind users (ages 21-43, 6 regular VizWiz users), Chorus:View completed product detail tasks in 295 seconds vs 440.7 for VizWiz, sequential information finding in 351.2 vs 406.8 seconds, and navigation tasks in 182.3 seconds (VizWiz could not be tested for navigation as it was not designed for this). Users rated Chorus:View median 6.0 on a 7-point Likert scale for both task types, compared to 3.0 and 2.0 for VizWiz. The video stream, despite lower resolution than VizWiz photos, produced better accuracy because workers could guide camera positioning through real-time feedback and maintain context across questions.

Relevance

Chorus:View represents a significant evolution in crowd-powered visual assistance for blind users, moving from discrete question-answer interactions to continuous conversational support. This conversational model more closely mirrors how sighted assistance works in practice — a sighted companion does not just answer one question, but provides ongoing guidance. For accessibility practitioners, the key insights are: (1) real-world visual tasks often require sequential, contextual interactions that single-image tools cannot support; (2) video streams with real-time feedback can outperform higher-resolution static images because workers can guide framing; and (3) multiple workers produce faster and more accurate responses than single workers through collaborative filtering. The system prefigures modern AI visual assistance tools like Be My Eyes' GPT-4 integration, and the finding that conversational interaction with camera guidance dramatically improves outcomes has direct implications for designing AI-powered visual assistance. The collaborative worker dynamics — using "we" language, discussing among themselves, being thanked by users — suggest that crowd-powered assistance creates more engaging and reliable help than isolated one-off interactions.

Tags: blind and low vision · crowdsourcing · human computation · assistive technology · visual assistance · VizWiz · conversational interface · mobile accessibility