AIDE: Automatic and Accessible Image Descriptions for Review Imagery in Online Retail

Rachana Sreedhar, Nicole Tan, Jingyue Zhang, Kim Jin, Spencer Gregson, Eli Moreta-Feliz, Niveditha Samudrala, Shrenik Sadalgi · 2022 · Proceedings of the 19th International Web for All Conference (W4A) · doi:10.1145/3493612.3520453

Summary

This paper from the Wayfair Next team presents AIDE (Automatic Image Description Engine), a multi-modal system that automatically generates alt-text for user-submitted review photos on e-commerce sites. While product images on retail sites sometimes have alt-text, customer review photos — which 75% of sighted shoppers prefer over staged imagery for authenticity — almost never do, leaving blind and low vision (BLV) shoppers unable to access this valuable content. The researchers first conducted a survey of 54 visually impaired participants, finding that 88.9% identified missing alt-text on images as their biggest challenge shopping online, and 53.7% always browse reviews. A follow-up moderated study with 7 BLV participants identified the most desired features in review image descriptions: product color (rated 4.8/5), unique product features (4.71/5), room setting (3.75/5), and an image overview. AIDE combines two parallel modules: a scene description module using computer vision (VinVL object-attribute detection fed into an Oscar image captioning model) to describe the visual scene, and a review text parsing module using Named Entity Recognition (NER) models built on Bidirectional LSTM-CNN architecture to extract product-specific attributes like color, style, material, and comfort from review comments. GPT-3 then combines these outputs into human-readable alt-text, prefixed with "Photo may be" to signal that accuracy is not guaranteed.

Key findings

In technical evaluation, 74% of AIDE-generated alt-text was approved by human raters as accurately describing the image contents — compared to only 67.4% for a similar system applied to social media content. Only 16% of AIDE descriptions included objects not present in the image (false positives), and the system captured 73.33% of distinct items in each image. AIDE was selected as a better descriptor than existing on-site alt-text 85.33% of the time. The evaluative user study with 18 visually impaired participants yielded strong results: 83.3% reported AIDE created a more inclusive shopping environment; participants rated their likelihood of recommending AIDE to other BLV people at 4.278 out of 5; and 75% felt they had access to all information needed to understand the product. Participants emphasized that AIDE increased feelings of independence — many described typically needing sighted assistance to shop online. A notable finding was that congenitally blind participants differed from those with acquired blindness in their alt-text preferences: congenitally blind users expressed less interest in color descriptions, while those who lost vision later specifically valued color and style information. Participants strongly preferred coherence between review text and image descriptions, finding inconsistencies confusing.

Relevance

This research addresses a practical and growing accessibility gap in e-commerce. As online shopping increasingly relies on visual user-generated content, BLV shoppers are excluded from information that sighted shoppers consider essential for purchasing decisions. AIDE demonstrates that combining computer vision with contextual text from reviews produces meaningfully better descriptions than either approach alone — a key insight for any organization generating automated alt-text. The finding that context-specific descriptions outperform generic image captions reinforces the broader principle that alt-text should be tailored to its purpose, not one-size-fits-all. The distinction between congenitally blind and acquired blindness users is particularly valuable for practitioners, suggesting that personalized accessibility preferences may be more effective than universal descriptions. Limitations include the small sample size, focus on furniture only, lack of counterbalanced conditions in the evaluative study, and the system's current restriction to a single product category. The reliance on GPT-3 for text generation also introduces variability and potential inaccuracies that the "Photo may be" prefix only partially addresses.

Tags: alternative text · image description · online shopping · blindness and low vision · computer vision · natural language processing · user-generated content · e-commerce accessibility

Standards referenced: WCAG