Vision Language Model

Also known as: VLM, Vision-Language Model, Multimodal Large Language Model

A machine-learning model trained to take both images and natural-language text as input and to produce natural-language output. Modern VLMs — such as GPT-4o, Gemini, and Claude — can describe a photo, read text inside an image, answer questions about a scene, identify objects, and compare multiple images. Vision Language Models are increasingly embedded in accessibility tools: Be My Eyes uses GPT-4o for its Be My AI feature, Microsoft's Seeing AI integrates VLM-based scene description, and research navigation robots use VLMs to convert real-time camera feeds into surrounding-environment descriptions for blind users. Compared with earlier computer-vision pipelines that required task-specific models, VLMs give a single general-purpose interface but introduce new accessibility concerns around hallucination, latency, and cost.

Category: AI · Artificial Intelligence · Machine Learning · Computer Vision · Assistive Technology

Related: Large Language Model · Computer Vision · Image Description · Artificial Intelligence

Sources