humanlaya/OneMillion-Bench
Official evals for $OneMillion-Bench
This tool helps researchers and AI developers evaluate how well large language models (LLMs) perform on complex, real-world tasks across various professional fields like finance, law, healthcare, and engineering. You provide a list of models and specific questions with detailed rubrics. The system then generates responses from the chosen LLMs, grades them against your rubrics using another 'judge' model, and produces clear reports in Excel or JSON format.
Use this if you need to objectively compare the performance of multiple language models on industry-specific questions, using a standardized, rubric-based evaluation system.
Not ideal if you're looking for a simple, quick way to test a single model's general conversational ability, as this tool is designed for rigorous, detailed, comparative professional evaluations.
Stars
32
Forks
2
Language
Python
License
Apache-2.0
Category
Last pushed
Mar 16, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/humanlaya/OneMillion-Bench"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Higher-rated alternatives
sierra-research/tau2-bench
τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment
xlang-ai/OSWorld
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
bigcode-project/bigcodebench
[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI
scicode-bench/SciCode
A benchmark that challenges language models to code solutions for scientific problems
THUDM/AgentBench
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)