SAP-samples/llm-agents-eval-tutorial
Tutorial Materials for the paper "Evaluation & Benchmarking of LLM Agents: A Survey" published in KDD 2025 Conference.
This tutorial helps data scientists and machine learning engineers systematically evaluate their large language model (LLM) agents. It provides a framework to assess agent behavior, capabilities, reliability, and safety, along with practical methods for setting up evaluations, choosing datasets, and computing metrics. If you're building or deploying LLM agents in production, this resource offers guidance on ensuring their performance and trustworthiness.
No commits in the last 6 months.
Use this if you are an applied or industry data scientist, machine learning engineer, or enterprise AI practitioner who needs a structured approach to evaluate LLM-based agents in production systems.
Not ideal if you are looking for a pre-built evaluation tool or a simple plug-and-play solution, as this focuses on providing a conceptual framework and hands-on guidance rather than a finished product.
Stars
16
Forks
5
Language
Jupyter Notebook
License
Apache-2.0
Category
Last pushed
Aug 05, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/SAP-samples/llm-agents-eval-tutorial"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Higher-rated alternatives
sierra-research/tau2-bench
τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment
xlang-ai/OSWorld
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
bigcode-project/bigcodebench
[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI
THUDM/AgentBench
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
scicode-bench/SciCode
A benchmark that challenges language models to code solutions for scientific problems