"Hands On" Visual Recognition for Visually Impaired Users

Joan Sosa-García, Francesca Odone · 2017 · ACM Transactions on Accessible Computing · doi:10.1145/3060056

Summary

This paper presents a collaborative visual recognition system designed to help blind or visually impaired (BVI) users identify specific product instances — distinguishing between brands, models, or types of objects that feel similar when handled. While BVI individuals can often identify an object category through touch and manipulation (recognizing they're holding a cereal box), they cannot determine which specific product it is without visual information. The system uses a three-module pipeline: voice input (user speaks the category, e.g., "pasta" or "cookies"), visual recognition (real-time image analysis against a gallery of known instances), and audio output (text-to-speech announces the identified product). Critically, all computation runs locally on the device, providing real-time feedback without requiring internet connectivity. The visual recognition module uses Bag-of-Features (BoF) descriptors with SURF features, chosen over deep learning approaches because they scale easily to new objects — users can add new products by simply capturing 10 images, without retraining a neural network. The researchers developed and released the GLASSENSE-Vision dataset containing seven use cases (banknotes, cereals, cans, medicines, water bottles, deodorant sticks, tomato sauces) with gallery and query images captured from the close-up, handheld perspective typical of BVI users manipulating objects.

Key findings

The visual recognition module achieved high accuracy across challenging conditions: 97-100% for banknotes, cereals, cans, medicines, and deodorant; 92% for tomato sauces. Performance remained stable (>95%) even when images were reduced to 60% resolution, demonstrating robustness to lower-quality cameras. The BoF approach outperformed pre-trained CNN features and existing commercial apps in this specific use case. Two proof-of-concept systems were evaluated: POC1 (smartphone-based) processed 5 frames per second, while POC2 (sensorized glasses from the GLASSENSE project) achieved 98% recognition in field tests with 14 BVI users aged 18-75. BVI users performed better with the glasses-based system than blindfolded sighted users did with the smartphone, suggesting that framing requirements were more intuitive with head-mounted cameras. Comparative testing against existing apps (Google Goggles, TapTapSee, LookTel Recognizer, Aipoly Vision) showed the proposed system achieved higher accuracy for instance recognition tasks. Importantly, existing apps either required precise framing (LookTel), relied on internet connectivity (TapTapSee, Google Goggles), or only recognized categories rather than specific instances (Aipoly). User feedback was strongly positive: high ratings for "easy to use" (median 5/5), "well integrated" (median 5/5), and "I would like to have it" (median 4.5/5). Users specifically appreciated the voice-only interaction and the ability to add their own personalized objects.

Relevance

This research demonstrates a practical approach to assistive computer vision that prioritizes the specific needs of BVI users: real-time local processing (no connectivity delays or privacy concerns), voice-based interaction (no precise framing required), and easy personalization (add new products without technical knowledge). The collaborative design — where users provide context about the object category and the system handles fine-grained visual discrimination — is a valuable pattern for accessible AI systems. Rather than attempting fully automatic recognition of arbitrary objects, the system leverages the user's existing capabilities (tactile identification of shape and material) and supplements what they cannot perceive (brand-specific visual details). For practitioners, the key insight is that off-the-shelf deep learning may not be optimal for accessibility applications requiring scalability and personalization. The image retrieval approach, while older, allows any user to add their own products without retraining — essential for a system that must accommodate individual preferences like favorite brands. The GLASSENSE-Vision dataset, publicly available, provides a benchmark specifically designed for the handheld, close-up acquisition conditions typical of BVI users.

Tags: visual impairment · object recognition · computer vision · assistive technology · wearable technology · image retrieval · machine learning