open-compass/opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

/ 100

Verified

This platform helps you understand how well different large language models (LLMs) perform on various tasks. You input specific LLMs and datasets, and it outputs detailed evaluation scores and benchmarks. It's designed for researchers, developers, or anyone building applications with LLMs who needs to compare and select the best model for their needs.

6,752 stars. Actively maintained with 12 commits in the last 30 days. Available on PyPI.

Use this if you need to systematically evaluate the performance of different large language models across a wide range of datasets and benchmarks to make informed decisions.

Not ideal if you're looking for a simple tool to fine-tune an LLM or just want to run a quick test on a single model without comprehensive comparison.

large-language-models AI-model-evaluation natural-language-processing model-benchmarking AI-research

Maintenance 17 / 25

Adoption 10 / 25

Maturity 25 / 25

Community 21 / 25

How are scores calculated?

Stars

6,752

Forks

743

Language

Python

License

Apache-2.0

Featured in

You're Shipping AI You Can't Measure

Recent Releases

0.5.2 14 Feb 2026 0.5.1.post1 17 Oct 2025 0.5.0 01 Sep 2025 0.4.2 02 Apr 2025 0.4.1 04 Mar 2025

Compare

opencompass and COMPASS

Related tools

IBM/unitxt

🦄 Unitxt is a Python library for enterprise-grade evaluation of AI performance, offering the...

lean-dojo/LeanDojo

Tool for data extraction and interacting with Lean programmatically.

GoodStartLabs/AI_Diplomacy

Frontier Models playing the board game Diplomacy.

google/litmus

Litmus is a comprehensive LLM testing and evaluation tool designed for GenAI Application...

NatLabRockies/COMPASS

INFRA-COMPASS is a tool that leverages Large Language Models (LLMs) to create and maintain an...

Explore LLM Tools

All categories Trending LLM Tool directory Insights