RaptorMai/MLLM-CompBench

[NeurIPS'25] MLLM-CompBench evaluates the comparative reasoning of MLLMs with 40K image pairs and questions across 8 dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. CompBench covers diverse visual domains, including animals, fashion, sports, and scenes

/ 100

Emerging

This project provides a ready-made dataset and questions to test how well AI models can compare two images. It takes in pairs of images from diverse categories like fashion, animals, and scenes, along with questions asking for comparisons related to visual traits, states, emotions, or quantities. The output helps evaluate the model's ability to identify subtle differences or similarities, which is useful for AI researchers and developers working on visual intelligence.

No commits in the last 6 months.

Use this if you are a researcher or AI developer who needs to rigorously benchmark the comparative reasoning capabilities of your multimodal AI models using a large, human-annotated dataset.

Not ideal if you are looking for a tool to train models from scratch or to perform image comparisons directly without a focus on AI model evaluation.

AI model evaluation computer vision research multimodal AI image analysis benchmark datasets

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 8 / 25

Maturity 16 / 25

Community 5 / 25

How are scores calculated?

Stars

Forks

Language

Jupyter Notebook

License

—

Higher-rated alternatives

stanfordnlp/axbench

Stanford NLP Python library for benchmarking the utility of LLM interpretability methods

aidatatools/ollama-benchmark

LLM Benchmark for Throughput via Ollama (Local LLMs)

LarHope/ollama-benchmark

Ollama based Benchmark with detail I/O token per second. Python with Deepseek R1 example.

qcri/LLMeBench

Benchmarking Large Language Models

THUDM/LongBench

LongBench v2 and LongBench (ACL 25'&24')

Explore Transformer Models

All categories Trending Transformer directory Insights