MLLM
Also known as: Multimodal LLM, Multimodal Large Language Model
A large language model extended to accept and reason over multiple input modalities — typically images and text, and sometimes audio or video — in addition to producing natural-language output. Examples include OpenAI's GPT-4o, Anthropic's Claude, and Google's Gemini. In accessibility contexts, MLLMs are used to generate image descriptions, narrate video, answer visual questions (VQA), and, as in SceneScout, describe street-level imagery for blind and low-vision users. MLLMs can hallucinate, produce plausible-but-unverifiable additions, and encode ableist assumptions from training data, so their use in safety-critical accessibility tools requires uncertainty disclosure and human verification mechanisms.
Category: AI · technology
Related: Hallucination · Chain-of-Thought Prompting · Multimodal Large Language Model