IMAGE: A Deployment Framework for Creating Multimodal Experiences of Web Graphics
Juliette Regimbal, Jeffrey R. Blum, Jeremy R. Cooperstock · 2022 · Proceedings of the 19th International Web for All Conference (W4A) · doi:10.1145/3493612.3520460
Summary
This short paper from McGill University introduces IMAGE (Internet Multimodal Access to Graphical Exploration), an open-source framework for converting web graphics into accessible multimodal outputs including audio, haptic, and text representations for blind and low vision users. The authors argue that current screen readers can only convey graphical content through text (alt-text), which is inadequate for information-dense images like charts, maps, and photographs. Previous standalone research projects addressing this gap — such as Twitter A11y, VizWiz, and VoiceOver Recognition — have either become defunct or remain proprietary, failing to achieve lasting widespread deployment. IMAGE addresses this by providing a common, modular platform that researchers and developers can build upon rather than starting from scratch each time. The system operates as a Chrome browser extension that collects image data from web pages and sends it to a server composed of Docker-based microservices. The server architecture follows a three-step pipeline: data collection (via the browser extension), processing (via preprocessors that perform tasks like object detection or machine learning classification), and synthesis (via handlers that produce the final renderings in formats like spatialized audio, text descriptions, or haptic output). The orchestrator microservice coordinates communication between all components, running preprocessors serially and handlers in parallel.
Key findings
IMAGE's key technical contribution is its modular, extensible architecture. Preprocessors and handlers are independent Docker containers that communicate via well-defined JSON schemas, allowing developers to add new components without modifying the browser extension, other handlers, or the orchestrator. A designer wanting to create a new spatialized sound rendering of photographs only needs to create a single handler container following the existing data format, which can be dynamically inserted into a running server by editing a configuration file. Similarly, a machine learning researcher can swap in an improved object detection model as a preprocessor container with matching inputs and outputs, instantly making it available to end users through the browser extension. The system supports multiple renderings per graphic — for example, a photograph could simultaneously generate a text description for a screen reader, a spatialized audio rendering using higher-order ambisonics (surround sound positioned around the listener based on object locations), and navigable audio segments with offsets for each semantic section. Common tasks like text-to-speech and ambisonic audio generation are abstracted into shared services that multiple handlers can call. Response times for the complete pipeline are generally 3-10 seconds.
Relevance
IMAGE addresses a real and persistent problem in accessibility research: the gap between promising lab prototypes and tools that blind and low vision users can actually use in their daily web browsing. By providing an open-source framework with Docker-based modularity, it lowers the barrier for researchers to deploy their work and for developers to combine multiple approaches into richer multimodal experiences. The emphasis on going beyond text-only descriptions toward haptic and spatialized audio representations reflects growing understanding that different types of graphics require different accessible representations — a chart needs different treatment than a photograph or a map. However, as the authors acknowledge, IMAGE is still in pre-release and lacks end-user evaluation. The paper presents architecture rather than evidence of user impact. The reliance on a Chrome extension and server infrastructure also introduces dependencies that may limit adoption. Documentation and debugging tools need improvement to make the platform practical for external contributors.
Tags: multimodal · web graphics · image accessibility · haptic feedback · sonification · open source · blindness and low vision · screen readers · assistive technology