Context-Aware Image Descriptions for Web Accessibility
Ananya Gubbi Mohanbabu, Amy Pavel · 2024 · Proceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS '24) · doi:10.1145/3663548.3675658
Summary
This paper addresses a fundamental limitation of current AI-generated image descriptions: they describe images in isolation without considering the surrounding webpage context. When blind and low-vision (BLV) users encounter images on the web, what they need to know about an image depends heavily on where it appears — the same photograph of a person might need a name and role on a company about page, a description of clothing on a fashion site, or a caption about the event on a news article. Existing tools that use vision-to-language models like GPT-4V generate generic descriptions that may include irrelevant details while missing contextually important information. The researchers designed a Chrome extension that automatically extracts webpage context — including the page title, surrounding text, heading hierarchy, link text, and the image's structural role (hero image, product photo, article illustration, etc.) — and uses this context to inform GPT-4V-generated descriptions. The system constructs a prompt that provides the model with both the image and its webpage context, asking it to generate a description tailored to what a user would need to know given that specific context. The researchers evaluated the system with 12 BLV participants who compared context-free descriptions (image only) with context-aware descriptions (image plus webpage context) across five real-world webpage categories: news, shopping, social media, personal blogs, and professional/organisational sites.
Key findings
BLV participants significantly preferred context-aware descriptions over context-free descriptions across all four quality measures: relevance (how well the description matched what they needed to know), plausibility (how believable the description was), quality (overall helpfulness), and imaginability (how well they could form a mental picture). In quantitative ratings on a 7-point Likert scale, context-aware descriptions scored significantly higher on all four dimensions, with the largest gap in relevance. Qualitatively, participants noted that context-free descriptions often included excessive visual details that were irrelevant to the page purpose — for example, describing the colour of a person's shirt on a news article about a policy announcement — while missing key contextual information like the person's name or role. Context-aware descriptions better identified people by name when the surrounding text provided that information, connected image content to the article's topic, and prioritised product-relevant details on shopping pages. However, the system was not without issues: AI hallucinations occurred in both conditions, sometimes fabricating details not present in the image, and context-aware descriptions occasionally over-relied on webpage text, parroting surrounding captions rather than describing what was actually visible. Participants universally expressed interest in using the tool in their daily browsing, particularly for online shopping (where product images are critical but often poorly described), social media (where personal photos lack alt text), and news sites. The study also found that participants wanted control over description length and detail level, with some preferring concise summaries and others wanting rich detail.
Relevance
This paper directly addresses one of the most persistent accessibility problems on the web: the absence or inadequacy of image descriptions. While WCAG 1.1.1 requires text alternatives for non-text content, compliance rates remain low and many alt texts that do exist are generic or unhelpful. AI-generated descriptions are increasingly positioned as a solution, but this work demonstrates that context-free AI descriptions are insufficient — they must be informed by the purpose the image serves on its specific page. For web developers and accessibility practitioners, the key takeaway is that good alt text is inherently contextual: the same image requires different descriptions depending on where it appears. The Chrome extension architecture provides a practical model for how context can be extracted and used to improve AI descriptions. The hallucination findings are an important caution: AI-generated descriptions can be confidently wrong, and BLV users have no way to verify visual claims independently, making accuracy critical. Limitations include the reliance on GPT-4V (with associated cost and latency), the controlled evaluation setting, and the relatively small participant pool, but the design principles are broadly applicable to any AI-assisted image description system.
Tags: alt text · image descriptions · blind and low vision · artificial intelligence · large language models · web accessibility · screen readers
Standards referenced: WCAG 2.1