open-compass/opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

73
/ 100
Verified

This platform helps you understand how well different large language models (LLMs) perform on various tasks. You input specific LLMs and datasets, and it outputs detailed evaluation scores and benchmarks. It's designed for researchers, developers, or anyone building applications with LLMs who needs to compare and select the best model for their needs.

6,752 stars. Actively maintained with 12 commits in the last 30 days. Available on PyPI.

Use this if you need to systematically evaluate the performance of different large language models across a wide range of datasets and benchmarks to make informed decisions.

Not ideal if you're looking for a simple tool to fine-tune an LLM or just want to run a quick test on a single model without comprehensive comparison.

large-language-models AI-model-evaluation natural-language-processing model-benchmarking AI-research
Maintenance 17 / 25
Adoption 10 / 25
Maturity 25 / 25
Community 21 / 25

How are scores calculated?

Stars

6,752

Forks

743

Language

Python

License

Apache-2.0

Last pushed

Mar 13, 2026

Commits (30d)

12

Dependencies

49

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/open-compass/opencompass"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.