google

google/siglip2-base-patch16-256

No description available.

Model Documentation

SigLIP 2 Base



SigLIP 2 extends the pretraining objective of SigLIP with prior, independently developed techniques into a unified recipe, for improved semantic understanding, localization, and dense features.

Intended uses



You can use the raw model for tasks like zero-shot image classification and image-text retrieval, or as a vision encoder for VLMs (and other vision tasks).

Here is how to use this model to perform zero-shot image classification:

python
from transformers import pipeline

load pipeline

ckpt = "google/siglip2-base-patch16-256" image_classifier = pipeline(model=ckpt, task="zero-shot-image-classification")

load image and candidate labels

url = "http://images.cocodataset.org/val2017/000000039769.jpg" candidate_labels = ["2 cats", "a plane", "a remote"]

run inference

outputs = image_classifier(image, candidate_labels) print(outputs)


You can encode an image using the Vision Tower like so:

python
import torch
from transformers import AutoModel, AutoProcessor
from transformers.image_utils import load_image

load the model and processor

ckpt = "google/siglip2-base-patch16-256" model = AutoModel.from_pretrained(ckpt, device_map="auto").eval() processor = AutoProcessor.from_pretrained(ckpt)

load the image

image = load_image("https://huggingface.co/datasets/merve/coco/resolve/main/val2017/000000000285.jpg") inputs = processor(images=[image], return_tensors="pt").to(model.device)

run infernece

with torch.no_grad(): image_embeddings = model.get_image_features(**inputs)

print(image_embeddings.shape)


For more code examples, we refer to the siglip documentation.

Training procedure



SigLIP 2 adds some clever training objectives on top of SigLIP:

1. Decoder loss 2. Global-local and masked prediction loss 3. Aspect ratio and resolution adaptibility

Training data



SigLIP 2 is pre-trained on the WebLI dataset (Chen et al., 2023).

Compute



The model was trained on up to 2048 TPU-v5e chips.

Evaluation results



Evaluation of SigLIP 2 is shown below (taken from the paper).

Evaluation Table

BibTeX entry and citation info



bibtex
@misc{tschannen2025siglip2multilingualvisionlanguage,
      title={SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features}, 
      author={Michael Tschannen and Alexey Gritsenko and Xiao Wang and Muhammad Ferjad Naeem and Ibrahim Alabdulmohsin and Nikhil Parthasarathy and Talfan Evans and Lucas Beyer and Ye Xia and Basil Mustafa and Olivier Hénaff and Jeremiah Harmsen and Andreas Steiner and Xiaohua Zhai},
      year={2025},
      eprint={2502.14786},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2502.14786}, 
}

Files & Weights

FilenameSizeAction
model.safetensors 1.40 GB