hyeonsangjeon/gdpval-realworks

Benchmark LLMs on real professional tasks, not academic puzzles. YAML-driven experiment pipeline + live React dashboard for GDPVal Gold Subset (220 tasks across 11 industries).

36
/ 100
Emerging

This project helps evaluate how well large language models (LLMs) perform on tasks like generating Excel reports, legal documents, or sales decks, rather than just academic puzzles. It takes your chosen LLM configuration as input and outputs a comprehensive evaluation and leaderboard displayed on a live dashboard. Anyone responsible for selecting or deploying LLMs for real-world business applications would use this.

Use this if you need to rigorously benchmark different LLMs on practical, professional tasks to ensure they meet the demands of your specific industry or job function.

Not ideal if you are looking to benchmark LLMs on traditional academic metrics like coding challenges or mathematical reasoning, or if you don't need a structured experiment pipeline and live dashboard.

LLM evaluation AI model selection business automation performance benchmarking industry-specific AI
No Package No Dependents
Maintenance 13 / 25
Adoption 5 / 25
Maturity 11 / 25
Community 7 / 25

How are scores calculated?

Stars

11

Forks

1

Language

Python

License

MIT

Category

mlops-end-to-end

Last pushed

Mar 28, 2026

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/mlops/hyeonsangjeon/gdpval-realworks"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.