Using Real-time Feedback to Improve Visual Question Answering

Yu Zhong, Phyo Thiha, Grant He, Walter Lasecki, Jeffrey Bigham · 2012 · CHI EA '12: CHI '12 Extended Abstracts on Human Factors in Computing Systems · doi:10.1145/2212776.2223834

Summary

This work-in-progress paper introduces Legion:View, a system that extends the VizWiz model of crowd-powered visual question answering by adding a real-time feedback loop between blind users and crowd workers. The original VizWiz allowed blind users to take a still photograph, record an audio question, and receive answers from crowd workers — but this was a one-shot interaction. If the photo was poorly framed, blurry, or showed the wrong side of an object, the user had to take a new photo, record a new question, and wait for a new set of workers to be recruited and respond. This iterative process could take over 10 minutes for seemingly simple tasks like reading cooking instructions on a box (as illustrated in the paper). Legion:View solves this by streaming live video from the user's iPhone camera to crowd workers recruited from Mechanical Turk, enabling continuous, real-time interaction. Workers view the video stream, listen to the user's audio question, and can provide two types of feedback: camera adjustment directions (move left, right, up, down, zoom in/out) that are merged via an input mediator using epoch-based voting, and text answers to questions that are forwarded directly to the user. Users can also record follow-up questions at any point without restarting the session, allowing natural conversational interaction with the same group of workers who already have context about the task and environment.

Key findings

An informal evaluation using five students as crowd workers tested three tasks: identifying a canned good (requiring the user to rotate the can to show the label), finding and reading a sign in a room with obstacles (requiring navigation guidance), and locating a specific TV dinner in a freezer and reading its cooking instructions. Workers were able to successfully guide the user through all three tasks using the real-time feedback mechanism. In the single-object identification task, workers directed the user to rotate the can until the label was visible and then answered the question. In the navigation task, workers guided the user from facing a blank wall to finding and reading a sign on the far wall. In the freezer task, workers helped the user locate the correct item among many and read its preparation instructions. The input mediator aggregated camera direction feedback by taking votes over short time intervals (epochs) and forwarding the majority-agreed direction, with inactivity treated as agreement with the current orientation rather than as laziness. The paper notes that answers to open-ended questions were forwarded directly rather than aggregated, since achieving real-time convergence on free-text responses is impractical.

Relevance

Legion:View represents a significant conceptual advance in crowd-powered assistive technology by shifting from asynchronous, one-shot interactions to synchronous, continuous assistance — essentially creating a real-time remote sighted guide service powered by crowd workers. This concept directly prefigures commercial services like Be My Eyes (launched 2015) and Aira (launched 2017), which connect blind users with sighted volunteers or agents via live video. For accessibility practitioners, the paper highlights a fundamental limitation of static photo-based visual assistance: many real-world tasks require back-and-forth interaction, spatial guidance, and the ability to refine the visual information being captured. The three test scenarios — object identification requiring reorientation, spatial navigation, and searching through multiple items — represent common daily challenges that static image analysis alone cannot solve effectively. The system architecture, combining a mobile app with a server-based worker coordination layer and input mediator for merging multiple workers' guidance into coherent directions, established a pattern for real-time crowd-powered accessibility services. Although the evaluation was preliminary (five student workers, not blind participants), the work demonstrated the feasibility of the approach and motivated further research into real-time human-powered visual assistance.

Tags: visual question answering · crowdsourcing · blind users · real-time systems · assistive technology · human computation · VizWiz · mobile accessibility