Going Beyond One-Size-Fits-All Image Descriptions to Satisfy the Information Wants of People Who are Blind or Have Low Vision
Abigale Stangl, Nitin Verma, Kenneth R. Fleischmann, Meredith Ringel Morris, Danna Gurari · 2021 · ASSETS '21: The 23rd International ACM SIGACCESS Conference on Computers and Accessibility · doi:10.1145/3441852.3471233
Summary
Current image description practices typically produce a single, one-size-fits-all description for each image, yet the same image can appear across vastly different contexts — news websites, e-commerce platforms, social media feeds, travel sites, and personal photo libraries — where users have fundamentally different information goals. Stangl et al. introduce the concept of "scenarios" as a contextual factor that should shape image descriptions, defining a scenario as the combination of an information goal (what the user wants to learn) and a source (where the image is encountered). Through in-person interviews with 28 people who are blind or have low vision (BLV), they presented five sample images across five scenarios: (A) learning about working conditions from a news site, (B) purchasing a gift from an e-commerce site, (C) finding out about a friend's activities on social media, (D) planning a trip from a travel site, and (E) sharing a personal photo. This yielded 700 responses (28 participants x 25 image-scenario combinations). The researchers used both inductive topic-based analysis and deductive term-based analysis to identify what content participants wanted described. The inductive analysis produced seven main information topic codes (identification of scene content, attributes of specific content, geographic details, activity, relationship, experience, and intent), while the deductive analysis categorized specific terms participants used under four parent codes: people, environment, activity/interaction, and objects.
Key findings
The central finding is that scenarios significantly influence what BLV people want in image descriptions. The same image elicited markedly different content wants depending on the scenario. For a news website scenario, participants prioritized activities of people and attributes of the work setting. For e-commerce, they focused on attributes of purchasable objects (clothing details, furniture styling, food characteristics). For social media, they wanted to understand the poster's intent and relationships. For travel planning, they sought geographic details and landscape attributes. For photo sharing, they focused on identification of places and activities of people. The deductive term-based analysis revealed which content types were universal (wanted across all five scenarios) versus scenario-specific. Universal content — forming a "minimum viable description" — includes: identity or names of people, clothing style, gender, location type, name of place, scenery, climate, look and feel of environments, and general descriptions of what's happening and what food is present. Scenario-specific content includes detailed attributes like height, body posture, hair color (wanted in only one scenario), or profession, race/diversity, and building style. Participants also requested information not typically found in image descriptions, such as the taste of food, the intent of the photographer, and the emotional experience of people in the image.
Relevance
This paper has direct implications for anyone writing alternative text or building automated image description systems. The key practical insight is that a single alt text string cannot serve all purposes — content authors and AI systems should consider where an image appears and what the user likely wants to know. The minimum viable description framework provides a concrete starting point: certain content types (people identification, location, general activity) should always be included, while additional detail should be context-specific. For web developers, this means alt text for a product image on an e-commerce site should emphasize different attributes than the same image used in a news article. For AI researchers building image captioning systems, this motivates context-aware architectures that accept both the image and its deployment scenario as inputs. The research also raises important ethical questions about describing people's perceived gender, race, and ethnicity — information participants consistently wanted but that carries risks of misrepresentation and emotional harm. This work contributes to a growing body of evidence that accessibility is not just about providing descriptions, but about providing the right descriptions for the right contexts.
Tags: image description · alternative text · blind · low vision · context-aware · image accessibility · computer vision · image captioning
Standards referenced: WCAG