hyeonsangjeon/gdpval-realworks
Benchmark LLMs on real professional tasks, not academic puzzles. YAML-driven experiment pipeline + live React dashboard for GDPVal Gold Subset (220 tasks across 11 industries).
This project helps evaluate how well large language models (LLMs) perform on tasks like generating Excel reports, legal documents, or sales decks, rather than just academic puzzles. It takes your chosen LLM configuration as input and outputs a comprehensive evaluation and leaderboard displayed on a live dashboard. Anyone responsible for selecting or deploying LLMs for real-world business applications would use this.
Use this if you need to rigorously benchmark different LLMs on practical, professional tasks to ensure they meet the demands of your specific industry or job function.
Not ideal if you are looking to benchmark LLMs on traditional academic metrics like coding challenges or mathematical reasoning, or if you don't need a structured experiment pipeline and live dashboard.
Stars
11
Forks
1
Language
Python
License
MIT
Category
Last pushed
Mar 28, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/mlops/hyeonsangjeon/gdpval-realworks"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
mlflow/mlflow
The open source AI engineering platform. MLflow enables teams of all sizes to debug, evaluate,...
kitops-ml/kitops
An open source DevOps tool from the CNCF for packaging and versioning AI/ML models, datasets,...
aws-samples/mlops-e2e
MLOps End-to-End Example using Amazon SageMaker Pipeline, AWS CodePipeline and AWS CDK
tensorchord/envd
🏕️ Reproducible development environment for humans and agents
techiescamp/mlops-for-devops
MLOps for DevOps Engineers - A hands-on, project-based guide to Machine Learning Operations