VizWiz: Nearly Real-Time Answers to Visual Questions

Jeffrey P. Bigham, Chandrika Jayant, Hanjie Ji, Greg Little, Andrew Miller, Robert C. Miller, Aubrey Tatarowicz, Brandyn White, Samuel White, Tom Yeh · 2010 · Proceedings of the 2010 International Cross Disciplinary Conference on Web Accessibility (W4A) · doi:10.1145/1805986.1806020

Summary

This paper introduces VizWiz, a pioneering mobile application that enables blind and low-vision people to get nearly real-time answers to visual questions by connecting their smartphone cameras to remote paid workers on Amazon Mechanical Turk. Users take a photo with their phone, speak a question about what they see, and receive a spoken answer in approximately 30 seconds — at a cost of about 4-5 cents per question. The system exploits two converging trends: smartphones with cameras and internet connectivity (iPhone and Android), and scalable online labour marketplaces. VizWiz addresses the fundamental limitation that automated solutions (OCR, barcode readers, colour detectors) can only handle narrow, specialised visual tasks and cost hundreds to over a thousand dollars each, while human workers can answer virtually any visual question. The system uses QuikTurkit, a custom abstraction layer built on top of TurKit that achieves near-real-time responses by pre-posting jobs before answers are needed and intelligently posting multiple jobs simultaneously — overcoming the fact that neither Mechanical Turk nor TurKit were designed for real-time interaction.

Key findings

The paper presents VizWiz through a compelling scenario of "Julie," a blind person who uses the system throughout her day: reading a handwritten notice on her mailbox, identifying who sent a letter, having a photograph described to her ("a bride and groom smiling and holding hands on the beach"), and determining which of two cans contains beans. Each question costs pennies, meaning a user could ask tens of thousands of general visual questions for the price of a single special-purpose automated device. The system was implemented for both iPhone and Android with accessible interfaces using double-tap interactions. VizWiz differs from prior human-powered accessibility systems like the ESP Game (image labelling for web accessibility) and IBM's Social Accessibility project (connecting blind web users to volunteers) in two fundamental ways: it targets real-time assistance rather than asynchronous help, and it operates in the physical world rather than on the web. Compared to existing services like ChaCha (5-15 minute response time, no images), VizWiz provides answers in 30 seconds and handles user-submitted photographs.

Relevance

VizWiz is one of the most influential accessibility research projects of the past two decades, anticipating by years the commercial services (Be My Eyes, Aira) and AI approaches (GPT-4V, Google Lookout) that now serve the same need. The paper demonstrates a fundamental insight: that human intelligence, accessed through crowdsourcing platforms, can bridge the gap between what automated systems can do and what disabled users actually need — and that this bridge can be fast and cheap enough for practical daily use. For accessibility practitioners, VizWiz illustrates the "collective impact" principle: individual visual questions (what's on this can? does this outfit match?) are minor nuisances, but collectively they erode independence. The system also pioneered the concept of visual question answering (VQA) as an accessibility service, which has since become a major AI research direction. The cost comparison remains instructive: at 4-5 cents per question, a blind person could ask 20,000 questions for the cost of one specialised ,000 assistive device — fundamentally reframing accessibility as a service rather than a product.

Tags: blind and low vision · crowdsourcing · assistive technology · mobile accessibility · human computation · visual question answering · independence · smartphone