zeyofu/BLINK_Benchmark

This repo contains evaluation code for the paper "BLINK: Multimodal Large Language Models Can See but Not Perceive". https://arxiv.org/abs/2404.12390 [ECCV 2024]

/ 100

Emerging

This project provides a benchmark to evaluate how well multimodal large language models (LLMs) can perform core visual perception tasks. It takes classic computer vision problems, like relative depth estimation or forensic detection, reformats them into multiple-choice questions with images, and then measures the accuracy of different LLMs. This is for researchers and developers working on improving the visual intelligence of multimodal AI models.

164 stars. No commits in the last 6 months.

Use this if you are a researcher or developer who wants to rigorously test and compare the visual perception capabilities of multimodal LLMs against human performance and other AI models.

Not ideal if you are looking for a tool to directly apply multimodal LLMs to solve real-world visual tasks, as this is an evaluation benchmark rather than an application.

multimodal-AI-evaluation computer-vision-benchmarking AI-perception-research LLM-visual-understanding

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 9 / 25

How are scores calculated?

Stars

164

Forks

Language

Python

License

Apache-2.0

Higher-rated alternatives

TheShadow29/awesome-grounding

awesome grounding: A curated list of research papers in visual grounding

microsoft/XPretrain

Multi-modality pre-training

TheShadow29/zsgnet-pytorch

Official implementation of ICCV19 oral paper Zero-Shot grounding of Objects from Natural...

TheShadow29/VidSitu

[CVPR21] Visual Semantic Role Labeling for Video Understanding (https://arxiv.org/abs/2104.00990)

gicheonkang/sglkt-visdial

🌈 PyTorch Implementation for EMNLP'21 Findings "Reasoning Visual Dialog with Sparse Graph...

Explore NLP Tools

All categories Trending NLP directory Insights