terryyz/llm-benchmark
A list of LLM benchmark frameworks.
This is a curated list of tools for evaluating Large Language Models (LLMs). It helps AI researchers, machine learning engineers, and data scientists choose the right benchmark for assessing an LLM's capabilities. You can compare various evaluation frameworks, understand their datasets, and select the most suitable one for your specific LLM project.
No commits in the last 6 months.
Use this if you are a researcher or engineer who needs to systematically compare the performance of different Large Language Models across various tasks and datasets.
Not ideal if you are looking for a tool to develop or fine-tune LLMs, as this focuses solely on evaluating existing models.
Stars
73
Forks
6
Language
—
License
Apache-2.0
Category
Last pushed
Feb 17, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/terryyz/llm-benchmark"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Higher-rated alternatives
sierra-research/tau2-bench
τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment
xlang-ai/OSWorld
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
bigcode-project/bigcodebench
[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI
THUDM/AgentBench
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
scicode-bench/SciCode
A benchmark that challenges language models to code solutions for scientific problems