AIDE: An Automatic Image Description Engine for Review Imagery

Rachana Sreedhar, Nicole Tan, Jingyue Zhang, Kim Jin, Spencer Gregson, Eli Moreta-Feliz, Niveditha Samudrala, Shrenik Sadalgi · 2022 · Proceedings of the 19th International Web for All Conference (W4A) · doi:10.1145/3493612.3520465

Summary

This paper from Wayfair presents AIDE, a multimodal machine learning system that automatically generates contextual alt-text for user-submitted review images in e-commerce — a category of imagery that is particularly inaccessible because it is user-generated, unpredictable in quality, and rarely accompanied by meaningful descriptions. The authors note that 66% of sighted shoppers consider review photos critical for purchasing decisions and 75% prefer user-submitted images over staged product photos for authenticity, yet typical alt-text for review images reads simply "Customer Image." Surveys with blind and low-vision (BLV) shoppers revealed that product color and unique product features are the most important information they want from review image descriptions, along with scene context. AIDE combines two parallel processing modules: a scene description module using the VinVL object detection model and Oscar image captioning to identify objects and generate visual descriptions, and a review text parsing module using Bidirectional LSTM-CNN Named Entity Recognition models to extract product-specific keywords (color, features) from the reviewer's written comments. GPT-3 then synthesizes these inputs into coherent, human-readable alt-text. For example, a review photo of a green couch in a living room paired with a comment about "mint green color" produces: "Photo may be: There is a green couch in the living room. There is a table and pictures on the wall."

Key findings

Human-in-the-loop evaluation by two sighted reviewers rating 75 alt-texts each showed: 74% of generated alt-text was approved as accurate (all mentioned objects present and correctly described), only 16% included a non-existent object, AIDE captured 73.3% of distinct items in the image, and AIDE was selected as a better descriptor 85.3% of the time over pre-existing alt-text. Virtual interviews with 18 BLV shoppers demonstrated strong accessibility impact: 83.3% reported AIDE created a more inclusive online shopping environment (statistically significant, p=.005), 75% felt they had access to all information needed to understand the product (p<.001), participants wanted to continue engaging with AIDE descriptions (M=3.528/5), and they were very likely to recommend it to other BLV people (M=4.278/5). The multimodal approach — combining visual analysis with textual review content — is key to generating context-specific rather than generic descriptions, addressing BLV shoppers' primary frustration with existing automated alt-text systems.

Relevance

This work tackles one of the most practically impactful alt-text challenges: user-generated content at scale. While product images on e-commerce sites can be manually described by the retailer, review images are uploaded by thousands of customers and cannot feasibly receive manual alt-text. For accessibility practitioners, AIDE demonstrates a powerful principle: combining computer vision with available textual context (review comments) produces dramatically better alt-text than either source alone. The approach of mining reviewer text for product-specific details like color — information that computer vision alone may struggle to convey accurately — is a clever multimodal strategy applicable to any context where images appear alongside descriptive text (social media, news, documentation). The use of GPT-3 to synthesize technical outputs into human-readable descriptions presages the now-widespread use of large language models for accessibility. The strong user feedback around independence and inclusion highlights that accessible shopping is not just a convenience issue but affects BLV people's autonomy and economic participation.

Tags: alt text · computer vision · blindness · visual impairment · screen readers · machine learning · natural language processing · image accessibility · e-commerce