RaptorMai/MLLM-CompBench
[NeurIPS'25] MLLM-CompBench evaluates the comparative reasoning of MLLMs with 40K image pairs and questions across 8 dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. CompBench covers diverse visual domains, including animals, fashion, sports, and scenes
This project provides a ready-made dataset and questions to test how well AI models can compare two images. It takes in pairs of images from diverse categories like fashion, animals, and scenes, along with questions asking for comparisons related to visual traits, states, emotions, or quantities. The output helps evaluate the model's ability to identify subtle differences or similarities, which is useful for AI researchers and developers working on visual intelligence.
No commits in the last 6 months.
Use this if you are a researcher or AI developer who needs to rigorously benchmark the comparative reasoning capabilities of your multimodal AI models using a large, human-annotated dataset.
Not ideal if you are looking for a tool to train models from scratch or to perform image comparisons directly without a focus on AI model evaluation.
Stars
44
Forks
2
Language
Jupyter Notebook
License
—
Category
Last pushed
Apr 21, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/RaptorMai/MLLM-CompBench"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
stanfordnlp/axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
aidatatools/ollama-benchmark
LLM Benchmark for Throughput via Ollama (Local LLMs)
LarHope/ollama-benchmark
Ollama based Benchmark with detail I/O token per second. Python with Deepseek R1 example.
qcri/LLMeBench
Benchmarking Large Language Models
THUDM/LongBench
LongBench v2 and LongBench (ACL 25'&24')