heilcheng/openevals
Benchmarking suite for open-weight language models
This tool helps AI researchers and practitioners systematically compare the performance of different open-weight language models. You input the language models you want to test and the academic benchmarks (like MMLU or HumanEval), and it outputs detailed performance scores, computational efficiency metrics, and publication-ready reports. It's designed for those who need to rigorously assess and understand the capabilities and resource usage of various large language models.
133 stars.
Use this if you need to objectively evaluate and benchmark open-weight large language models across standard academic tasks and compare their performance and efficiency.
Not ideal if you are looking for a tool to fine-tune models or develop new models, as this is solely for evaluation and benchmarking.
Stars
133
Forks
12
Language
Python
License
MIT
Category
Last pushed
Dec 29, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/ml-frameworks/heilcheng/openevals"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
opentensor/bittensor
Internet-scale Neural Networks
trailofbits/fickling
A Python pickling decompiler and static analyzer
benchopt/benchopt
A framework for reproducible, comparable benchmarks
BiomedSciAI/fuse-med-ml
A python framework accelerating ML based discovery in the medical field by encouraging code...
mosaicml/streaming
A Data Streaming Library for Efficient Neural Network Training