Multi-Modal LLM

Also known as: Multimodal Large Language Model, MLLM, Vision-Language Model

A large language model that can process and reason over more than one type of input modality, typically text combined with images, audio, or video. In accessibility research, multi-modal LLMs such as GPT-4o, CLIP, and BLIP-2 are increasingly used to analyse screenshots of web pages or mobile GUIs for issues such as missing alternative text, unreadable dynamic content, or visual-only feedback that fails blind users. Their ability to jointly interpret images and natural-language prompts enables tasks that previously required separate computer-vision and NLP pipelines.

Category: ai · testing

Related: Large Language Model · Prompt Engineering · Zero-Shot Learning

Sources

https://openai.com/index/hello-gpt-4o/
https://doi.org/10.1145/3793673