Large Vision Model

Also known as: LVM

A large vision model is a foundation model trained on very large image (and often video) datasets to produce general-purpose visual representations - capable of object detection, segmentation, captioning, or feature extraction without task-specific retraining. Examples include SAM (Segment Anything), CLIP, DINOv2, and the vision backbones of GPT-4V and Moondream. In accessibility, LVMs are commonly paired with LLMs to convert camera input into structured descriptions for blind and low-vision users, or into scene understanding for navigation and physical assistance.

Category: Artificial Intelligence · Computer Vision · AI and accessibility · Machine Learning

Related: Foundation Model · Vision-Language Model · Large Language Model

Sources

https://arxiv.org/abs/2108.07258