PKU-YuanGroup/Video-Bench
A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models!
This project provides a comprehensive way to assess how well large language models (LLMs) can understand and reason about video content. It takes various video datasets and associated question-and-answer pairs as input, then produces a systematic evaluation of how accurately different video-based LLMs perform. This is for researchers and developers who are building or improving LLMs specifically designed to interpret and make decisions based on video.
138 stars. No commits in the last 6 months.
Use this if you need to rigorously test and compare the capabilities of different video-based large language models across a range of understanding and decision-making tasks.
Not ideal if you are looking for an off-the-shelf video analysis tool for end-user applications rather than an LLM evaluation benchmark.
Stars
138
Forks
3
Language
Python
License
—
Category
Last pushed
Dec 31, 2023
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/PKU-YuanGroup/Video-Bench"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
stanfordnlp/axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
aidatatools/ollama-benchmark
LLM Benchmark for Throughput via Ollama (Local LLMs)
LarHope/ollama-benchmark
Ollama based Benchmark with detail I/O token per second. Python with Deepseek R1 example.
qcri/LLMeBench
Benchmarking Large Language Models
THUDM/LongBench
LongBench v2 and LongBench (ACL 25'&24')