Q-Future/Q-Bench

①[ICLR2024 Spotlight] (GPT-4V/Gemini-Pro/Qwen-VL-Plus+16 OS MLLMs) A benchmark for multi-modality LLMs (MLLMs) on low-level vision and visual quality assessment.

/ 100

Emerging

This project provides a standardized way to test how well multi-modal Large Language Models (LLMs) understand and interpret visual content, especially focusing on 'low-level' image qualities like brightness, blurriness, or overall visual appeal. It takes images (single or pairs) and questions about their visual characteristics as input, then evaluates how accurately the LLM answers. This is useful for researchers and developers who are building or evaluating visual AI systems and need to rigorously assess their performance on fine-grained visual details.

282 stars. No commits in the last 6 months.

Use this if you are a researcher or engineer working with multi-modal LLMs and need a benchmark to evaluate their ability to perceive, describe, and assess image quality and other low-level visual attributes.

Not ideal if you are looking to apply LLMs for high-level image understanding tasks like object recognition or scene description without a focus on the underlying visual quality or detailed perception.

multi-modal AI evaluation image quality assessment computer vision benchmarking LLM visual understanding AI model performance

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 10 / 25

How are scores calculated?

Stars

282

Forks

Language

Jupyter Notebook

License

—

Featured in

You're Shipping AI You Can't Measure

Higher-rated alternatives

sierra-research/tau2-bench

τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment

xlang-ai/OSWorld

[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

bigcode-project/bigcodebench

[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI

scicode-bench/SciCode

A benchmark that challenges language models to code solutions for scientific problems

THUDM/AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

Explore LLM Tools

All categories Trending LLM Tool directory Insights