Evaluating the Effectiveness of STEM Images Captioning

Maurizio Leotta, Marina Ribaudo · 2024 · Proceedings of the 21st International Web for All Conference (W4A '24) · doi:10.1145/3677846.3677863

Summary

This paper from the University of Genoa reports on an experiment comparing human-written and AI-generated alternative text descriptions for complex STEM images — graphs, diagrams, scientific illustrations, and mathematical figures. The researchers recruited 52 undergraduate Computer Science students and split them into two groups: one received basic training on image accessibility (covering POUR principles from WCAG, how to write quality STEM descriptions, and the concepts of correct/incorrect and useful/useless predicates) before the experiment, while the other received the same training only afterward. Students were asked to write descriptions for 12 images (9 STEM, 3 non-STEM) sourced from Wikimedia Commons, spanning science, technology, engineering, and mathematics categories. In parallel, the researchers generated descriptions using IDEFICS, an open-access visual language model based on DeepMind's Flamingo. In a second phase, students evaluated two descriptions per image — one by a peer and one by the AI (without knowing which was which) — rating each on correctness, usefulness, and overall quality. The study frames alt text quality through the lens of textual predicates: descriptions should contain statements that are both correct (accurately reflecting what is in the image) and useful (conveying the meaning or purpose of the image for someone who cannot see it). The experiment was conducted in Italian to avoid comprehension barriers.

Key findings

Human-written descriptions consistently outperformed AI-generated ones across all three measures — correctness, usefulness, and quality — with statistically significant differences (p<0.01) and large effect sizes for STEM images. For correctness, human descriptions scored a mean of 4.17 versus 2.83 for AI on all images among untrained participants. A key problem with the AI engine was hallucination: IDEFICS fabricated elements not present in the images, such as describing "lightning striking a volcano" in a physics trajectory diagram. STEM images were harder to describe than non-STEM images for both humans and AI, receiving lower scores across all metrics regardless of who authored the description. Critically, even basic accessibility training made a measurable difference: trained students wrote better descriptions and became more critical evaluators, assigning lower scores overall because they could better identify deficiencies. The effect of training was evident even though it consisted of just a short introductory lecture covering WCAG principles and examples of good STEM descriptions.

Relevance

This research has direct implications for accessibility practitioners and educators. It demonstrates that current AI visual language models are not yet reliable for generating alt text for complex STEM content — a finding particularly relevant as organisations increasingly look to automate accessibility remediation. The hallucination problem is especially concerning for screen reader users who have no way to verify AI-generated descriptions against the actual image. The study also makes a compelling case for integrating even minimal accessibility training into Computer Science curricula: a single introductory session measurably improved both the quality of descriptions students produced and their ability to evaluate others' work. For organisations producing STEM educational content, the findings reinforce that human review of image descriptions remains essential, and that investing in staff training on alt text best practices yields tangible improvements in content quality.

Tags: alt text · image accessibility · STEM accessibility · AI captioning · computer science education · visual language models

Standards referenced: WCAG 2.1