Surfacing Variations to Calibrate Perceived Reliability of MLLM-generated Image Descriptions

Meng Chen, Akhil Iyer, Amy Pavel · 2025 · ASSETS 2025: 27th International ACM SIGACCESS Conference on Computers and Accessibility · doi:10.1145/3663547.3746393

Summary

This paper addresses a critical safety problem in AI-powered visual access technology: multimodal large language models (MLLMs) like GPT-4o, Gemini, and Claude produce fluent, confident image descriptions that can contain fabricated content, misinterpretations, and omissions that are extremely difficult for blind and low vision (BLV) users to detect without sight. BLV users already employ creative workarounds such as cross-checking descriptions across multiple AI tools and consulting sighted people, but these are time-consuming and impractical. The researchers developed a systematic approach to surface variations across multiple MLLM responses, making inconsistencies and unreliable claims visible to BLV users. The work contributes three things: a design space for eliciting and presenting MLLM variations (covering dimensions like elicitation method, comparison support, granularity, support indicators, provenance indicators, and modality), a prototype system implementing three variation presentation styles, and findings from a controlled user study with 15 BLV participants. The prototype queries three MLLMs (GPT-4o, Claude 3.7 Sonnet, Gemini 1.5 Pro) three times each, generating nine descriptions per image, then uses Gemini 2.5 Pro with chain-of-thought prompting to decompose descriptions into atomic facts, group them, annotate sources, and produce three output formats: a list of multiple descriptions, a variation-aware description (hierarchical markdown with variations highlighted inline), and a variation summary (organized into agreements, disagreements, and unique mentions).

Key findings

The study produced striking quantitative results: surfacing variations increased users' ability to identify unreliable claims by 4.9x compared to single descriptions (mean 2.62 unreliable claims identified with the variation approach vs. 0.53 with single descriptions, p < 0.001). Presenting variations also significantly decreased perceived reliability of MLLM responses from 5.78/7 for single descriptions to 3.93/7 for the variation approach (p < 0.01). 14 of 15 participants preferred seeing variations over single descriptions. The variation summary was the most preferred presentation style (ranked first by 11 of 15 participants), followed by variation-aware descriptions (ranked second by 9 of 15). Participants used inconsistency between descriptions as the primary signal for unreliability (94% of claims in the variation condition were flagged due to inconsistency), while with single descriptions, lack of detail was the main indicator (54% of claims). Regarding support indicators, model source was preferred by 5 participants, percentage by 4, no indicator by 4, and language indicators by only 2. Participants found variations most useful for high-stakes scenarios (healthcare, medication, navigation) and subjective tasks (outfit selection, social media posting, room aesthetics). The work also revealed that BLV users tend to over-trust AI descriptions—the study helped calibrate this trust by making uncertainty visible.

Relevance

This research has immediate practical implications for the rapidly growing ecosystem of MLLM-powered visual access tools (Be My AI, Seeing AI, AccessAI, etc.) used by millions of BLV people daily. The finding that single AI descriptions create dangerous overreliance—and that surfacing variations can dramatically improve error detection—should inform how these tools present information. The design space and prototype offer a concrete blueprint for tool developers to implement variation-aware features. The work is particularly timely as MLLMs become the dominant interface between BLV users and visual information, replacing older computer vision approaches. The emphasis on trust calibration rather than simply improving accuracy acknowledges that AI errors cannot be eliminated, but users can be empowered to detect them. Future implications include extending variation surfacing to video descriptions, computer use agents, and other AI-mediated accessibility tools where reliability assessment is critical.

Tags: blindness · low vision · image descriptions · multimodal AI · large language models · AI reliability · trust calibration · screen readers · visual access technology

Standards referenced: WCAG 2.1