humanlaya/OneMillion-Bench

Official evals for $OneMillion-Bench

37
/ 100
Emerging

This tool helps researchers and AI developers evaluate how well large language models (LLMs) perform on complex, real-world tasks across various professional fields like finance, law, healthcare, and engineering. You provide a list of models and specific questions with detailed rubrics. The system then generates responses from the chosen LLMs, grades them against your rubrics using another 'judge' model, and produces clear reports in Excel or JSON format.

Use this if you need to objectively compare the performance of multiple language models on industry-specific questions, using a standardized, rubric-based evaluation system.

Not ideal if you're looking for a simple, quick way to test a single model's general conversational ability, as this tool is designed for rigorous, detailed, comparative professional evaluations.

AI-evaluation language-model-benchmarking professional-domain-AI LLM-assessment rubric-based-grading
No Package No Dependents
Maintenance 13 / 25
Adoption 7 / 25
Maturity 11 / 25
Community 6 / 25

How are scores calculated?

Stars

32

Forks

2

Language

Python

License

Apache-2.0

Last pushed

Mar 16, 2026

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/humanlaya/OneMillion-Bench"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.