SigLIP

Also known as: Sigmoid Loss for Language Image Pre-Training

A vision-language model that uses sigmoid loss instead of contrastive loss for aligning images with text descriptions. SigLIP improves upon CLIP by using a more efficient training objective that computes image-text similarity without requiring large batch sizes. In accessibility research, SigLIP has shown strong performance for tasks like matching cooking video frames to recipe steps, achieving higher baseline accuracy than CLIP in object status recognition experiments.

Category: artificial intelligence · computer vision

Related: CLIP · Vision-Language Model · Object Status Recognition

Sources

https://doi.org/10.1145/3663547.3746318