TIGER-AI-Lab/VLM2Vec

This repo contains the code for "VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks" [ICLR 2025]

/ 100

Established

This project helps you understand and compare content across different formats like images, videos, and complex visual documents by converting them into a unified numerical representation. You input various visual materials, and it outputs a consistent 'embedding' for each, allowing for easier analysis and search. This tool is ideal for researchers, data scientists, or analysts working with large, diverse collections of multimedia.

592 stars.

Use this if you need to find similarities, classify, or retrieve information across a massive collection of images, videos, and visual documents like reports or scanned forms.

Not ideal if your primary need is solely text-based analysis or if you only deal with a single, simple visual modality.

multimedia-analysis information-retrieval document-intelligence data-science content-management

No Package No Dependents

Maintenance 10 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 16 / 25

How are scores calculated?

Stars

592

Forks

Language

Python

License

Apache-2.0

Related models

lightonai/pylate

Late Interaction Models Training & Retrieval

Jorffy/NoteMR

[CVPR 2025] Code for "Notes-guided MLLM Reasoning: Enhancing MLLM with Knowledge and Visual...

Explore Transformer Models

All categories Trending Transformer directory Insights