Q-Future/Q-Bench
①[ICLR2024 Spotlight] (GPT-4V/Gemini-Pro/Qwen-VL-Plus+16 OS MLLMs) A benchmark for multi-modality LLMs (MLLMs) on low-level vision and visual quality assessment.
This project provides a standardized way to test how well multi-modal Large Language Models (LLMs) understand and interpret visual content, especially focusing on 'low-level' image qualities like brightness, blurriness, or overall visual appeal. It takes images (single or pairs) and questions about their visual characteristics as input, then evaluates how accurately the LLM answers. This is useful for researchers and developers who are building or evaluating visual AI systems and need to rigorously assess their performance on fine-grained visual details.
282 stars. No commits in the last 6 months.
Use this if you are a researcher or engineer working with multi-modal LLMs and need a benchmark to evaluate their ability to perceive, describe, and assess image quality and other low-level visual attributes.
Not ideal if you are looking to apply LLMs for high-level image understanding tasks like object recognition or scene description without a focus on the underlying visual quality or detailed perception.
Stars
282
Forks
13
Language
Jupyter Notebook
License
—
Category
Last pushed
Aug 12, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/Q-Future/Q-Bench"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Higher-rated alternatives
sierra-research/tau2-bench
τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment
xlang-ai/OSWorld
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
bigcode-project/bigcodebench
[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI
scicode-bench/SciCode
A benchmark that challenges language models to code solutions for scientific problems
THUDM/AgentBench
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)