humanlaya/OneMillion-Bench

Official evals for $OneMillion-Bench

/ 100

Emerging

This tool helps researchers and AI developers evaluate how well large language models (LLMs) perform on complex, real-world tasks across various professional fields like finance, law, healthcare, and engineering. You provide a list of models and specific questions with detailed rubrics. The system then generates responses from the chosen LLMs, grades them against your rubrics using another 'judge' model, and produces clear reports in Excel or JSON format.

Use this if you need to objectively compare the performance of multiple language models on industry-specific questions, using a standardized, rubric-based evaluation system.

Not ideal if you're looking for a simple, quick way to test a single model's general conversational ability, as this tool is designed for rigorous, detailed, comparative professional evaluations.

AI-evaluation language-model-benchmarking professional-domain-AI LLM-assessment rubric-based-grading

No Package No Dependents

Maintenance 13 / 25

Adoption 7 / 25

Maturity 11 / 25

Community 6 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

Apache-2.0

Featured in

You're Shipping AI You Can't Measure

Higher-rated alternatives

sierra-research/tau2-bench

τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment

xlang-ai/OSWorld

[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

bigcode-project/bigcodebench

[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI

scicode-bench/SciCode

A benchmark that challenges language models to code solutions for scientific problems

THUDM/AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

Explore LLM Tools

All categories Trending LLM Tool directory Insights