MLLM

Also known as: Multimodal LLM, Multimodal Large Language Model

A large language model extended to accept and reason over multiple input modalities — typically images and text, and sometimes audio or video — in addition to producing natural-language output. Examples include OpenAI's GPT-4o, Anthropic's Claude, and Google's Gemini. In accessibility contexts, MLLMs are used to generate image descriptions, narrate video, answer visual questions (VQA), and, as in SceneScout, describe street-level imagery for blind and low-vision users. MLLMs can hallucinate, produce plausible-but-unverifiable additions, and encode ableist assumptions from training data, so their use in safety-critical accessibility tools requires uncertainty disclosure and human verification mechanisms.

Category: AI · technology

Related: Hallucination · Chain-of-Thought Prompting · Multimodal Large Language Model

Sources

https://doi.org/10.1145/3772318.3790449
https://platform.openai.com/docs/models/gpt-4o