rentruewang/bocoel

Bayesian Optimization as a Coverage Tool for Evaluating LLMs. Accurate evaluation (benchmarking) that's 10 times faster with just a few lines of modular code.

/ 100

Emerging

This helps AI researchers and machine learning engineers quickly and accurately evaluate how well large language models (LLMs) perform on various tasks. You provide your large dataset, and it intelligently selects a small, representative subset to test the LLM on, giving you fast and reliable performance metrics. This is ideal for anyone working with LLMs who needs to benchmark their models efficiently without spending excessive time or computational resources.

289 stars.

Use this if you need to benchmark the accuracy of large language models on extensive datasets but want to drastically reduce the time and cost involved in the evaluation process.

Not ideal if your evaluation needs are for small datasets or if you are not working with large language models.

LLM evaluation AI benchmarking machine learning research model performance optimization data sampling

No Package No Dependents

Maintenance 10 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 11 / 25

How are scores calculated?

Stars

289

Forks

Language

Python

License

BSD-3-Clause

Featured in

You're Shipping AI You Can't Measure

Higher-rated alternatives

sierra-research/tau2-bench

τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment

xlang-ai/OSWorld

[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

bigcode-project/bigcodebench

[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI

scicode-bench/SciCode

A benchmark that challenges language models to code solutions for scientific problems

THUDM/AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

Explore LLM Tools

All categories Trending LLM Tool directory Insights