Improving Accessibility of HTML Documents by Generating Image-Tags in a Proxy

Daniel Keysers, Marius Renn, Thomas M. Breuel · 2007 · Proceedings of the 9th International ACM SIGACCESS Conference on Computers and Accessibility (Assets '07) · doi:10.1145/1296843.1296896

Summary

This paper from the German Research Center for Artificial Intelligence (DFKI) and Technical University Kaiserslautern presents a system that automatically generates ALT tags for images on web pages by analysing image contents through a web proxy. The system addresses the persistent problem that only 39.6% of significant images on high-traffic websites had alternative text at the time of writing, despite WCAG Guideline 1 requiring text equivalents for all non-text content. The architecture uses an HTTP proxy that intercepts web requests, fetches embedded images, and sends them to an image tagger that assigns both a category (e.g., photo, icon, graphic) and descriptive tags based on actual image content. The approach is grounded in the assumption that tags already exist naturally for many images in photo collections like Flickr and personal databases, and these human-created tags can train the system. The image tagger uses a modified version of the FIRE image search engine, which maintains a large database of images with known tags. For a new image, a k-nearest-neighbour search finds visually similar images using a weighted combination of Tamura texture features and RGB colour histograms with Jensen-Shannon divergence as the distance measure. The tags and categories of the nearest matches are combined using a voting scheme to produce output tags.

Key findings

The system successfully demonstrated automatic generation of descriptive ALT tags shown as tooltips on mouse-hover, with categories and content descriptions. Example outputs included tags like "photo, lettering, sign" for a newspaper image and "photo, people, horses" for a sports photograph. The system also performs rule-based analysis to generate additional textual descriptions covering image size, dominant colours, focus, and brightness — providing even basic high-level properties that screen reader users would otherwise miss entirely. An interesting limitation was revealed: the system labelled an image of soccer players on green grass as "horses" because the tagged image database contained horses on grass but no soccer players, illustrating the critical dependency on having a comprehensive, well-labelled training database. The authors noted that even without deep semantic analysis, the retrieval-based approach captures useful information about colour, location, and the presence of people, and that high-level properties like emotion and atmosphere can be communicated through low-level features like colour and brightness.

Relevance

This 2007 paper is a historically significant early attempt at what would become a major area of AI-powered accessibility — automatic image description. While modern systems using deep learning and large language models now produce far more sophisticated alt text, this work established the proxy-based architecture pattern (no browser modification needed, transparent to the user) and the content-based image retrieval approach that influenced subsequent research. For practitioners, the paper highlights issues that remain relevant today: the dependency on training data quality (the "horses for soccer players" error), the distinction between image category (photo vs. icon vs. graphic) and image content description, and the value of even imperfect automatic descriptions over no description at all. The proxy approach also has the advantage of working without requiring website owners to take action, addressing the reality that much of the web remains inaccessible despite guidelines.

Tags: automatic alt text · image accessibility · web accessibility · computer vision · image retrieval · web proxy · content-based image retrieval · screen readers · machine learning

Standards referenced: WCAG 1.0