rentruewang/bocoel
Bayesian Optimization as a Coverage Tool for Evaluating LLMs. Accurate evaluation (benchmarking) that's 10 times faster with just a few lines of modular code.
This helps AI researchers and machine learning engineers quickly and accurately evaluate how well large language models (LLMs) perform on various tasks. You provide your large dataset, and it intelligently selects a small, representative subset to test the LLM on, giving you fast and reliable performance metrics. This is ideal for anyone working with LLMs who needs to benchmark their models efficiently without spending excessive time or computational resources.
289 stars.
Use this if you need to benchmark the accuracy of large language models on extensive datasets but want to drastically reduce the time and cost involved in the evaluation process.
Not ideal if your evaluation needs are for small datasets or if you are not working with large language models.
Stars
289
Forks
16
Language
Python
License
BSD-3-Clause
Category
Last pushed
Jan 18, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/rentruewang/bocoel"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Higher-rated alternatives
sierra-research/tau2-bench
τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment
xlang-ai/OSWorld
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
bigcode-project/bigcodebench
[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI
scicode-bench/SciCode
A benchmark that challenges language models to code solutions for scientific problems
THUDM/AgentBench
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)