TIGER-AI-Lab/VLM2Vec

This repo contains the code for "VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks" [ICLR 2025]

52
/ 100
Established

This project helps you understand and compare content across different formats like images, videos, and complex visual documents by converting them into a unified numerical representation. You input various visual materials, and it outputs a consistent 'embedding' for each, allowing for easier analysis and search. This tool is ideal for researchers, data scientists, or analysts working with large, diverse collections of multimedia.

592 stars.

Use this if you need to find similarities, classify, or retrieve information across a massive collection of images, videos, and visual documents like reports or scanned forms.

Not ideal if your primary need is solely text-based analysis or if you only deal with a single, simple visual modality.

multimedia-analysis information-retrieval document-intelligence data-science content-management
No Package No Dependents
Maintenance 10 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 16 / 25

How are scores calculated?

Stars

592

Forks

51

Language

Python

License

Apache-2.0

Last pushed

Mar 09, 2026

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/TIGER-AI-Lab/VLM2Vec"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.