Multi-Modal LLM
Also known as: Multimodal Large Language Model, MLLM, Vision-Language Model
A large language model that can process and reason over more than one type of input modality, typically text combined with images, audio, or video. In accessibility research, multi-modal LLMs such as GPT-4o, CLIP, and BLIP-2 are increasingly used to analyse screenshots of web pages or mobile GUIs for issues such as missing alternative text, unreadable dynamic content, or visual-only feedback that fails blind users. Their ability to jointly interpret images and natural-language prompts enables tasks that previously required separate computer-vision and NLP pipelines.
Related: Large Language Model · Prompt Engineering · Zero-Shot Learning