hyeonsangjeon/gdpval-realworks

Benchmark LLMs on real professional tasks, not academic puzzles. YAML-driven experiment pipeline + live React dashboard for GDPVal Gold Subset (220 tasks across 11 industries).

/ 100

Emerging

This project helps evaluate how well large language models (LLMs) perform on tasks like generating Excel reports, legal documents, or sales decks, rather than just academic puzzles. It takes your chosen LLM configuration as input and outputs a comprehensive evaluation and leaderboard displayed on a live dashboard. Anyone responsible for selecting or deploying LLMs for real-world business applications would use this.

Use this if you need to rigorously benchmark different LLMs on practical, professional tasks to ensure they meet the demands of your specific industry or job function.

Not ideal if you are looking to benchmark LLMs on traditional academic metrics like coding challenges or mathematical reasoning, or if you don't need a structured experiment pipeline and live dashboard.

LLM evaluation AI model selection business automation performance benchmarking industry-specific AI

No Package No Dependents

Maintenance 13 / 25

Adoption 5 / 25

Maturity 11 / 25

Community 7 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Higher-rated alternatives

mlflow/mlflow

The open source AI engineering platform. MLflow enables teams of all sizes to debug, evaluate,...

kitops-ml/kitops

An open source DevOps tool from the CNCF for packaging and versioning AI/ML models, datasets,...

aws-samples/mlops-e2e

MLOps End-to-End Example using Amazon SageMaker Pipeline, AWS CodePipeline and AWS CDK

tensorchord/envd

🏕️ Reproducible development environment for humans and agents

techiescamp/mlops-for-devops

MLOps for DevOps Engineers - A hands-on, project-based guide to Machine Learning Operations

Explore MLOps Tools

All categories Trending MLOps directory Insights