Analyzing Visual Questions from Visually Impaired Users

Erin L. Brady · 2011 · The Proceedings of the 13th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS) · doi:10.1145/2049536.2049622

Summary

This doctoral consortium paper presents an analysis of the types of visual questions that visually impaired users ask through VizWiz, a mobile phone application that provides near-realtime answers to visual questions. VizWiz allows users to take a photo with their phone, speak a question about its contents, and receive answers from a combination of sources: crowd workers on Amazon Mechanical Turk, IQ Engine object recognition software, and the user's social network via Twitter or email. The study aimed to understand what visual information blind users most want to know about their surroundings and how well different answer sources serve those needs. The researcher randomly sampled 100 questions from over 5,000 submitted to VizWiz, transcribed the audio questions, and coded them in two passes into categories. For the 100 questions analyzed, 195 answers were submitted, and each was evaluated for whether it correctly answered the user's question.

Key findings

The analysis identified six major categories of visual questions: Identification ("What is this object?"), Description ("What does this look like?"), Spatial ("Where is something located?"), Reading ("What does this text say?"), Answering (questions requiring reasoning about visual content), and Other. The study found that existing automated tools like IQ Engine and Google Goggles could handle straightforward identification questions but failed at more complex queries requiring human reasoning or contextual understanding. Questions that needed description, spatial information, or nuanced interpretation were best served by human crowd workers. The hybrid approach of combining automated object recognition with human crowdsourcing proved uniquely beneficial — users could select the most appropriate answer source based on their question type, or use both when unsure. This analysis revealed that visually impaired users' information needs are diverse and often go beyond simple object identification to include contextual and relational understanding of their environment.

Relevance

This early study of VizWiz laid groundwork that has become increasingly relevant as visual AI assistants have proliferated. The taxonomy of visual question types identified here — identification, description, spatial, reading, and reasoning — maps closely to the capabilities now offered by modern AI vision systems and remains a useful framework for evaluating how well these tools serve blind users. The finding that many real-world questions require human-level reasoning, not just object recognition, anticipated a key challenge that AI developers continue to address. For accessibility practitioners, this research underscores that understanding user motivations and question patterns is essential for designing effective visual assistance tools. The work also highlights the value of combining automated and human-powered approaches, a design pattern still used in services like Be My Eyes and Aira.

Tags: blindness and low vision · crowdsourcing · computer vision · mobile accessibility · visual question answering · object recognition · assistive technology