PKU-YuanGroup/Video-Bench

A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models!

/ 100

Experimental

This project provides a comprehensive way to assess how well large language models (LLMs) can understand and reason about video content. It takes various video datasets and associated question-and-answer pairs as input, then produces a systematic evaluation of how accurately different video-based LLMs perform. This is for researchers and developers who are building or improving LLMs specifically designed to interpret and make decisions based on video.

138 stars. No commits in the last 6 months.

Use this if you need to rigorously test and compare the capabilities of different video-based large language models across a range of understanding and decision-making tasks.

Not ideal if you are looking for an off-the-shelf video analysis tool for end-user applications rather than an LLM evaluation benchmark.

video-understanding large-language-models model-evaluation AI-benchmarking video-AI-development

No License Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 8 / 25

Community 5 / 25

How are scores calculated?

Stars

138

Forks

Language

Python

License

—

Higher-rated alternatives

stanfordnlp/axbench

Stanford NLP Python library for benchmarking the utility of LLM interpretability methods

aidatatools/ollama-benchmark

LLM Benchmark for Throughput via Ollama (Local LLMs)

LarHope/ollama-benchmark

Ollama based Benchmark with detail I/O token per second. Python with Deepseek R1 example.

qcri/LLMeBench

Benchmarking Large Language Models

THUDM/LongBench

LongBench v2 and LongBench (ACL 25'&24')

Explore Transformer Models

All categories Trending Transformer directory Insights