Large multimodal model

Also known as: LMM, Multimodal AI, Vision-language model

An artificial intelligence model capable of processing and generating content across multiple modalities, such as text, images, and audio. Examples include GPT-4V and Gemini. In accessibility applications, large multimodal models enable powerful new capabilities like generating image descriptions, verifying visual edits through natural language, interpreting visual scenes for blind users, and creating alternative representations of visual content. However, they can also hallucinate or produce inaccurate descriptions, requiring careful design around trust and verification.

Category: Machine Learning · Assistive Technology

Related: Image description · Alt text · Screen reader

Sources

https://en.wikipedia.org/wiki/Multimodal_learning