shikiw/Modality-Integration-Rate

[ICCV 2025] The official code of the paper "Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate".

/ 100

Emerging

This project helps AI researchers and practitioners evaluate and improve Large Vision-Language Models (LVLMs). It takes a pre-trained LVLM, along with text and image datasets, to output a 'Modality Integration Rate' score. This score indicates how well the model combines visual and textual information, guiding developers in refining their model's cross-modal understanding.

111 stars. No commits in the last 6 months.

Use this if you are developing or fine-tuning Large Vision-Language Models and need a quantitative metric to assess how effectively your model integrates visual and textual information during pre-training.

Not ideal if you are an end-user simply looking to apply an existing Vision-Language Model for tasks like image captioning or visual question answering, as this is a developer tool for model analysis.

AI model evaluation Vision-Language Model development Multimodal AI research Deep learning engineering Model interpretability

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 9 / 25

Maturity 16 / 25

Community 4 / 25

How are scores calculated?

Stars

111

Forks

Language

Python

License

MIT

Higher-rated alternatives

TinyLLaVA/TinyLLaVA_Factory

A Framework of Small-scale Large Multimodal Models

zjunlp/EasyInstruct

[ACL 2024] An Easy-to-use Instruction Processing Framework for LLMs.

rese1f/MovieChat

[CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

haotian-liu/LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

NVlabs/Eagle

Eagle: Frontier Vision-Language Models with Data-Centric Strategies

Explore Transformer Models

All categories Trending Transformer directory Insights