nomic-ai

nomic-ai/colnomic-embed-multimodal-3b

No description available.

Model Documentation

ColNomic Embed Multimodal 3B: State-of-the-Art Visual Document Retrieval

colnomic-embed-multimodal-3b is a multi-vector state-of-the-art multimodal embedding model that excels at visual document retrieval tasks:

High Performance: Achieves 61.2 NDCG@5 on Vidore-v2, outperforming all other models except ColNomic Embed Multimodal 7B

Unified Text-Image Encoding: Directly encodes interleaved text and images without complex preprocessing

Advanced Architecture: 3B parameter multimodal embedding model

Open-Weights: Model weights available for research use

Performance

| Model | Avg. | ESG Restaurant Human | Econ Macro Multi. | AXA Multi. | MIT Bio | ESG Restaurant Synth. | ESG Restaurant Synth. Multi. | MIT Bio Multi. | AXA | Econ. Macro | |-------|------|----------------------|-------------------|------------|---------|----------------------|----------------------------|---------------|-----|------------| | ColNomic Embed Multimodal 7B| 62.7 | 73.9 | 54.7 | 61.3 | 66.1 | 57.3 | 56.7 | 64.2 | 68.3 | 61.6 | | ColNomic Embed Multimodal 3B | 61.2 | 65.8 | 55.4 | 61.0 | 63.5 | 56.6 | 57.2 | 62.5 | 68.8 | 60.2 | | T-Systems ColQwen2.5-3B | 59.9 | 72.1 | 51.2 | 60.0 | 65.3 | 51.7 | 53.3 | 61.7 | 69.3 | 54.8 | | Nomic Embed Multimodal 7B | 59.7 | 65.7 | 57.7 | 59.3 | 64.0 | 49.2 | 51.9 | 61.2 | 66.3 | 63.1 | | GME Qwen2 7B | 59.0 | 65.8 | 56.2 | 55.4 | 64.0 | 54.3 | 56.7 | 55.1 | 60.7 | 62.9 | | Nomic Embed Multimodal 3B | 58.8 | 59.8 | 57.5 | 58.8 | 62.5 | 49.4 | 49.4 | 58.6 | 69.6 | 63.5 | | Llama Index vdr-2b-multi-v1 | 58.4 | 63.1 | 52.8 | 61.0 | 60.6 | 50.3 | 51.2 | 56.9 | 68.8 | 61.2 | | Voyage Multimodal 3 | 55.0 | 56.1 | 55.0 | 59.5 | 56.4 | 47.2 | 46.2 | 51.5 | 64.1 | 58.8 |

Getting Started

To use colnomic-embed-multimodal-3b, please install colpali from source

bash
pip install git+https://github.com/illuin-tech/colpali.git

python
import torch
from PIL import Image
from transformers.utils.import_utils import is_flash_attn_2_available

from colpali_engine.models import ColQwen2_5, ColQwen2_5_Processor

model_name = "nomic-ai/colnomic-embed-multimodal-3b"

model = ColQwen2_5.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",  or "mps" if on Apple Silicon
    attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None,
).eval()

processor = ColQwen2_5_Processor.from_pretrained(model_name)

Your inputs
images = [
    Image.new("RGB", (128, 128), color="white"),
    Image.new("RGB", (64, 32), color="black"),
]
queries = [
    "What is the organizational structure for our R&D department?",
    "Can you provide a breakdown of last year’s financial performance?",
]

Process the inputs
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)

Forward pass
with torch.no_grad():
    image_embeddings = model(**batch_images)
    query_embeddings = model(**batch_queries)

scores = processor.score_multi_vector(query_embeddings, image_embeddings)

Model Architecture

Total Parameters: 3B

Training Approach: Fine-tuned from Qwen2.5-VL 3B Instruct

Architecture Type: Vision-Language Model with unified text and image input processing

Key Innovations:

Same-source sampling to create harder in-batch negatives

Multi-vector output option for enhanced performance

Integration with RAG Workflows

Nomic Embed Multimodal 3B seamlessly integrates with Retrieval Augmented Generation (RAG) workflows:

1. Direct Document Embedding: Skip OCR and complex processing by directly embedding document page images 2. Faster Processing: Eliminate preprocessing steps for quicker indexing 3. More Complete Information: Capture both textual and visual cues in a single embedding 4. Simple Implementation: Use the same API for both text and images

Recommended Use Cases

The model excels at handling real-world document retrieval scenarios that challenge traditional text-only systems:

Research Papers: Capture equations, diagrams, and tables

Technical Documentation: Encode code blocks, flowcharts, and screenshots

Product Catalogs: Represent images, specifications, and pricing tables

Financial Reports: Embed charts, graphs, and numerical data

Visually Rich Content: Where layout and visual information are important

Multilingual Documents: Where visual context provides important cues

Training Details

ColNomic Embed Multimodal 3B was developed through several key innovations:

1. Sampling From the Same Source: Forcing sampling from the same dataset source creates harder in-batch negatives, preventing the model from learning dataset artifacts.

2. Multi-Vector Configuration: Providing a multi-vector variant that achieves higher performance than the dense variant.

Limitations

Performance may vary when processing documents with unconventional layouts or unusual visual elements

While it handles multiple languages, performance is strongest on English content

Processing very large or complex documents may require dividing them into smaller chunks

Performance on documents with handwriting or heavily stylized fonts may be reduced

Join the Nomic Community

Nomic Embed Ecosystem: https://www.nomic.ai/embed

Website: https://nomic.ai

Twitter: https://twitter.com/nomic_ai

Discord: https://discord.gg/myY5YDR8z8

Citation

If you find this model useful in your research or applications, please consider citing:

bibtex
@misc{faysse2024colpaliefficientdocumentretrieval,
  title={ColPali: Efficient Document Retrieval with Vision Language Models}, 
  author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
  year={2024},
  eprint={2407.01449},
  archivePrefix={arXiv},
  primaryClass={cs.IR},
  url={https://arxiv.org/abs/2407.01449}, 
}
@misc{ma2024unifyingmultimodalretrievaldocument,
      title={Unifying Multimodal Retrieval via Document Screenshot Embedding}, 
      author={Xueguang Ma and Sheng-Chieh Lin and Minghan Li and Wenhu Chen and Jimmy Lin},
      year={2024},
      eprint={2406.11251},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2406.11251}, 
}
@misc{nomicembedmultimodal2025,
  title={Nomic Embed Multimodal: Interleaved Text, Image, and Screenshots for Visual Document Retrieval},
  author={Nomic Team},
  year={2025},
  publisher={Nomic AI},
  url={https://nomic.ai/blog/posts/nomic-embed-multimodal},
}

Files & Weights

Filename	Size	Action
adapter_model.safetensors	0.22 GB	Download