"It's trained by non-disabled people": Evaluating How Image Quality Affects Product Captioning with Vision-Language Models

Kapil Garg, Xinru Tang, Jimin Heo, Dwayne R. Morgan, Darren Gergle, Erik B. Sudderth, Anne Marie Piper · 2026 · Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26) · doi:10.1145/3772318.3791309

Summary

Garg and colleagues investigate how well Vision-Language Models (VLMs) caption product images taken by blind and low-vision (BLV) people — a high-stakes everyday task that increasingly depends on tools like Be My AI, Microsoft Seeing AI, and general-purpose assistants such as ChatGPT and Gemini. The paper combines two complementary studies. Study 1 is an online survey of 86 BLV adults in the United States who use AI captioning tools, probing which products they caption (food, toiletries, medications, unknown household items), when they prefer AI over human assistance (Aira, Be My Eyes), how they perceive image-quality issues such as blur, framing, lighting, hand position, distance, and rotation, and what kinds of captioning errors they encounter. Study 2 is a systematic model evaluation: the team curated 1,859 real product images from BLV people (729 high-quality, 1,130 low-quality) derived from the VizWiz dataset, manually annotated each with product type, brand, and variety ground truth, and generated captions with four VLMs — GPT-4.1, Gemini 2.5 Flash, Llama 3.2 90B, and Molmo 72B — using a fixed prompt. Four researchers then hand-coded accuracy across 7,436 captions (Krippendorff's alpha 0.859) and fit logistic-regression models relating accuracy to image-quality dimensions (blur, framing, rotation) and product properties (rounded labels, text panels such as nutrition facts). The framing is explicitly disability-centered: the authors argue VLM benchmarks built on clean datasets like ImageNet and MS COCO hide the failure modes that matter most to BLV users.

Key findings

On high-quality BLV-taken photos, closed-source VLMs perform very well — GPT-4.1 at 98.5% and Gemini 2.5 Flash at 95.7% product-identification accuracy — but on the low-quality subset accuracy collapses: GPT to 74.9%, Gemini to 71.7%, Llama 3.2 90B to 44.1%, and Molmo 72B to 36.1%. When multiple issues co-occur (e.g., blur + framing + rotation), even GPT falls to 69.4%. Regression analysis shows all three quality dimensions reduce the odds of correct identification significantly (blur -88.3%, framing -84.5%, rotation -79.5%). Text panels (nutrition labels, box-recipe backs) hurt accuracy more than rounded labels alone, and the combination of a rounded label plus a text panel is especially damaging across the non-GPT models. Error types reported by users match the model audit: captions frequently omit brand, variety, or allergen detail ("it told me it was a package of meat" rather than pork chops), partially correct captions (agave nectar called maple syrup) can be more dangerous than obviously wrong ones, and hallucinations sometimes reach into life-safety territory (canned pears captioned as canned peaches risked an allergic reaction). Survey participants also reported that current photo-guidance features — Seeing AI's framing beeps, Be My AI's retake prompts — are inconsistent, and two-thirds said taking the photo is the hardest part of the workflow.

Relevance

This is the first large-scale, disability-centered benchmark of product captioning for BLV users, and it provides a rare quantitative case that accessibility tooling built on general-purpose VLMs is far less reliable in the field than marketing or benchmark scores suggest. For practitioners, the findings argue against relying on a single AI caption for medication, allergen, or flavor identification and in favor of multi-step photo guidance, partial abstention, and honest uncertainty communication in the caption itself. For tool and model builders, the paper offers concrete recommendations: curate training data that includes realistically degraded BLV-taken photos, fine-tune on paired clean/distorted images, invest in image-repair and inpainting at inference time, and use evaluation metrics that prioritise specific product details (brand, variety) rather than generic category match. Limitations include a U.S./English-only sample, a binary treatment of image-quality issues, and a focus on identification accuracy rather than overall caption quality — all of which future cross-cultural and multilingual work should extend.

Tags: blind and low vision · vision-language models · image captioning · product identification · hallucinations · image quality · disability-centric evaluation · AI accessibility · assistive technology · Be My AI · Seeing AI