TIGER-AI-Lab/VLM2Vec
This repo contains the code for "VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks" [ICLR 2025]
This project helps you understand and compare content across different formats like images, videos, and complex visual documents by converting them into a unified numerical representation. You input various visual materials, and it outputs a consistent 'embedding' for each, allowing for easier analysis and search. This tool is ideal for researchers, data scientists, or analysts working with large, diverse collections of multimedia.
592 stars.
Use this if you need to find similarities, classify, or retrieve information across a massive collection of images, videos, and visual documents like reports or scanned forms.
Not ideal if your primary need is solely text-based analysis or if you only deal with a single, simple visual modality.
Stars
592
Forks
51
Language
Python
License
Apache-2.0
Category
Last pushed
Mar 09, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/TIGER-AI-Lab/VLM2Vec"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.