SAP-samples/llm-agents-eval-tutorial

Tutorial Materials for the paper "Evaluation & Benchmarking of LLM Agents: A Survey" published in KDD 2025 Conference.

38
/ 100
Emerging

This tutorial helps data scientists and machine learning engineers systematically evaluate their large language model (LLM) agents. It provides a framework to assess agent behavior, capabilities, reliability, and safety, along with practical methods for setting up evaluations, choosing datasets, and computing metrics. If you're building or deploying LLM agents in production, this resource offers guidance on ensuring their performance and trustworthiness.

No commits in the last 6 months.

Use this if you are an applied or industry data scientist, machine learning engineer, or enterprise AI practitioner who needs a structured approach to evaluate LLM-based agents in production systems.

Not ideal if you are looking for a pre-built evaluation tool or a simple plug-and-play solution, as this focuses on providing a conceptual framework and hands-on guidance rather than a finished product.

LLM-agent-evaluation AI-system-benchmarking enterprise-AI-deployment machine-learning-operations AI-governance
Stale 6m No Package No Dependents
Maintenance 2 / 25
Adoption 6 / 25
Maturity 15 / 25
Community 15 / 25

How are scores calculated?

Stars

16

Forks

5

Language

Jupyter Notebook

License

Apache-2.0

Last pushed

Aug 05, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/SAP-samples/llm-agents-eval-tutorial"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.