heilcheng/openevals

Benchmarking suite for open-weight language models

/ 100

Emerging

This tool helps AI researchers and practitioners systematically compare the performance of different open-weight language models. You input the language models you want to test and the academic benchmarks (like MMLU or HumanEval), and it outputs detailed performance scores, computational efficiency metrics, and publication-ready reports. It's designed for those who need to rigorously assess and understand the capabilities and resource usage of various large language models.

133 stars.

Use this if you need to objectively evaluate and benchmark open-weight large language models across standard academic tasks and compare their performance and efficiency.

Not ideal if you are looking for a tool to fine-tune models or develop new models, as this is solely for evaluation and benchmarking.

AI-research NLP-benchmarking model-evaluation LLM-comparison computational-efficiency

No Package No Dependents

Maintenance 6 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 13 / 25

How are scores calculated?

Stars

133

Forks

Language

Python

License

MIT

Higher-rated alternatives

opentensor/bittensor

Internet-scale Neural Networks

trailofbits/fickling

A Python pickling decompiler and static analyzer

benchopt/benchopt

A framework for reproducible, comparable benchmarks

BiomedSciAI/fuse-med-ml

A python framework accelerating ML based discovery in the medical field by encouraging code...

mosaicml/streaming

A Data Streaming Library for Efficient Neural Network Training

Explore ML Frameworks

All categories Trending ML Framework directory Insights