MMStar-Benchmark/MMStar

[NeurIPS 2024] This repo contains evaluation code for the paper "Are We on the Right Way for Evaluating Large Vision-Language Models"

/ 100

Experimental

This project helps AI researchers and developers accurately assess the true capabilities of Large Vision-Language Models (LVLMs). It takes evaluation results from your LVLM, both with and without visual input, and produces metrics that reveal how much the visual component genuinely contributes, identifying potential overestimations. It's designed for those building and refining multimodal AI models.

204 stars. No commits in the last 6 months.

Use this if you are developing or evaluating Large Vision-Language Models and want to accurately measure the impact of visual information on their performance, ensuring the visual input is truly indispensable for the task.

Not ideal if you are evaluating models that only process text or only process images, as it specifically focuses on the interplay between vision and language.

AI-model-evaluation multimodal-AI vision-language-models AI-benchmarking model-performance-analysis

No License Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 8 / 25

Community 6 / 25

How are scores calculated?

Stars

204

Forks

Language

Python

License

—

Higher-rated alternatives

ExtensityAI/symbolicai

A neurosymbolic perspective on LLMs

TIGER-AI-Lab/MMLU-Pro

The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding...

deep-symbolic-mathematics/LLM-SR

[ICLR 2025 Oral] This is the official repo for the paper "LLM-SR" on Scientific Equation...

microsoft/interwhen

A framework for verifiable reasoning with language models.

zhudotexe/fanoutqa

Companion code for FanOutQA: Multi-Hop, Multi-Document Question Answering for Large Language...

Explore Transformer Models

All categories Trending Transformer directory Insights