Accessify: An ML Powered Application to Provide Accessible Images on Web Sites

Shivam Singh, Anurag Bhandari, Nishith Pathak · 2018 · Proceedings of the 15th International Web for All Conference (W4A 2018) · doi:10.1145/3192714.3192830

Summary

This demonstration paper presents Accessify, a browser plugin that uses machine learning to automatically generate alternative text descriptions for all images on a website, injecting them into the page’s DOM so screen readers can access them. The system addresses the persistent problem that most websites either lack image descriptions entirely or provide inadequate ones (such as file names in the alt attribute). Unlike existing solutions — CMS plugins for WordPress and Drupal that require developer/moderator action, or tools like Auto Alt Text that require users to right-click each image individually — Accessify works automatically and unobtrusively with a single click to activate for a browser tab. The architecture splits between a client-side browser plugin and a server-side application. The plugin intercepts page load events via the browser’s Web Request API, sends image data to the Accessify server, and asynchronously injects generated descriptions into the page’s HTML image elements. Descriptions are cached in the browser’s local storage (persisting across sessions) and versioned, so revisiting a site is fast and descriptions improve as the model is retrained. The server uses Google’s Show and Tell model — a deep neural network combining an Inception-v3 CNN encoder (pre-trained on ImageNet) with an LSTM decoder — to generate natural language captions from images. The model is further trained using web-scraped images and their existing alt text/tags, and continually fine-tuned as new data is collected. A Node.js server with nginx handles the API, image hashing (to deduplicate images across domains), and model retraining.

Key findings

The system generates captions that describe visual content in natural language — examples shown include "a close up of a cake on a plate" and "a busy city street filled with lots of traffic" for photographs of a red velvet cake and a taxi-filled street respectively. Accessify supports multilingual output: descriptions are generated in English by the ML model and then translated to the user’s preferred language using open-source machine translation APIs, expanding accessibility to non-English-speaking users. The system handles several practical challenges: image hashing ensures that the same image used across multiple websites generates only one caption (saving processing time); local caching with version comparison reduces server calls on repeat visits; asynchronous processing prevents the plugin from blocking page rendering; and the RESTful API design (GetImageData, GetVersionInformation, PostImageData endpoints) enables scalability. The server architecture uses an in-memory database and non-blocking I/O for performance. New images encountered during browsing are fed back into the training pipeline, meaning the model continuously improves from real-world web images. The system works on any website regardless of the underlying technology (static or dynamic) because it operates on the rendered DOM rather than requiring server-side integration.

Relevance

Accessify represents an early example of using deep learning to retrofit image accessibility onto existing websites without requiring action from website owners or developers — a user-side approach that circumvents the persistent failure of content creators to provide alt text. For accessibility practitioners, this raises important questions about the trade-off between imperfect automated descriptions and no descriptions at all. The Show and Tell model generates generic visual descriptions ("a cake on a plate") rather than contextual ones ("the winning entry in the 2018 baking competition"), which may be useful for understanding what an image depicts but cannot convey its communicative purpose on the page. This limitation — highlighted in the W3C’s emphasis that alt text should serve the "equivalent purpose" of images, not merely describe their visual content — applies to all automated image captioning systems. The continuous retraining from web-scraped images and existing alt text is a pragmatic approach but risks learning from the very inadequate descriptions the system aims to replace. As a demonstration paper from 2018, Accessify predates the significant advances in vision-language models (GPT-4V, Gemini) that have since dramatically improved image captioning quality, making the underlying approach even more viable today.

Tags: alternative text · image accessibility · machine learning · browser extension · computer vision · deep learning · screen readers · web accessibility · image captioning · TensorFlow

Standards referenced: WCAG