SAP-samples/llm-agents-eval-tutorial

Tutorial Materials for the paper "Evaluation & Benchmarking of LLM Agents: A Survey" published in KDD 2025 Conference.

/ 100

Emerging

This tutorial helps data scientists and machine learning engineers systematically evaluate their large language model (LLM) agents. It provides a framework to assess agent behavior, capabilities, reliability, and safety, along with practical methods for setting up evaluations, choosing datasets, and computing metrics. If you're building or deploying LLM agents in production, this resource offers guidance on ensuring their performance and trustworthiness.

No commits in the last 6 months.

Use this if you are an applied or industry data scientist, machine learning engineer, or enterprise AI practitioner who needs a structured approach to evaluate LLM-based agents in production systems.

Not ideal if you are looking for a pre-built evaluation tool or a simple plug-and-play solution, as this focuses on providing a conceptual framework and hands-on guidance rather than a finished product.

LLM-agent-evaluation AI-system-benchmarking enterprise-AI-deployment machine-learning-operations AI-governance

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 6 / 25

Maturity 15 / 25

Community 15 / 25

How are scores calculated?

Stars

Forks

Language

Jupyter Notebook

License

Apache-2.0

Featured in

You're Shipping AI You Can't Measure

Higher-rated alternatives

sierra-research/tau2-bench

τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment

xlang-ai/OSWorld

[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

bigcode-project/bigcodebench

[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI

THUDM/AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

scicode-bench/SciCode

A benchmark that challenges language models to code solutions for scientific problems

Explore LLM Tools

All categories Trending LLM Tool directory Insights