zeyofu/BLINK_Benchmark
This repo contains evaluation code for the paper "BLINK: Multimodal Large Language Models Can See but Not Perceive". https://arxiv.org/abs/2404.12390 [ECCV 2024]
This project provides a benchmark to evaluate how well multimodal large language models (LLMs) can perform core visual perception tasks. It takes classic computer vision problems, like relative depth estimation or forensic detection, reformats them into multiple-choice questions with images, and then measures the accuracy of different LLMs. This is for researchers and developers working on improving the visual intelligence of multimodal AI models.
164 stars. No commits in the last 6 months.
Use this if you are a researcher or developer who wants to rigorously test and compare the visual perception capabilities of multimodal LLMs against human performance and other AI models.
Not ideal if you are looking for a tool to directly apply multimodal LLMs to solve real-world visual tasks, as this is an evaluation benchmark rather than an application.
Stars
164
Forks
8
Language
Python
License
Apache-2.0
Category
Last pushed
Sep 27, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/zeyofu/BLINK_Benchmark"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
TheShadow29/awesome-grounding
awesome grounding: A curated list of research papers in visual grounding
microsoft/XPretrain
Multi-modality pre-training
TheShadow29/zsgnet-pytorch
Official implementation of ICCV19 oral paper Zero-Shot grounding of Objects from Natural...
TheShadow29/VidSitu
[CVPR21] Visual Semantic Role Labeling for Video Understanding (https://arxiv.org/abs/2104.00990)
gicheonkang/sglkt-visdial
🌈 PyTorch Implementation for EMNLP'21 Findings "Reasoning Visual Dialog with Sparse Graph...