Visual question answering

Also known as: VQA, Visual QA

A computer vision and natural language processing task in which a system answers natural language questions about the content of an image or video. In accessibility contexts, VQA enables blind and visually impaired users to query visual content interactively — asking specific questions like "What color is this bottle?" or "Is there a sign on the door?" — rather than receiving pre-generated descriptions. Modern VQA systems powered by large multimodal models can process live video feeds, enabling real-time conversational assistance. However, current VQA systems face significant accessibility challenges including inaccurate spatial information, assumptions about user visual abilities, sycophantic responses, and inability to proactively describe dynamic scenes without explicit prompting.

Category: computer vision · assistive technology · artificial intelligence · visual impairment

Related: Large multimodal model · Image description · Visual assistance technology · Screen reader

Sources

https://doi.org/10.1145/3663547.3746319