Contrastive Decoding

Also known as: Visual Contrastive Decoding, VCD

Contrastive decoding is a technique for reducing hallucinations in large language model and multimodal AI outputs by comparing token probability distributions across different input conditions. The core principle is that tokens genuinely grounded in the input content will change significantly in probability when the input is transformed or degraded, while hallucinated tokens driven by language priors will remain relatively stable across conditions. In the visual domain, contrastive decoding compares model outputs for original and transformed versions of an image—such as noise-injected, cropped, or retrieved reference images—to identify which generated words are genuinely supported by visual evidence versus inferred from statistical patterns. This approach is particularly relevant for assistive technology applications where BLV users rely on AI-generated descriptions and cannot independently verify their accuracy.

Category: multimodal AI · machine learning · assistive technology

Related: Visual Hallucination · Multimodal AI · Image Captioning · Large Language Model

Sources

https://doi.org/10.1145/3785360