microsoft/LLF-Bench
A benchmark for evaluating learning agents based on just language feedback
This project provides a set of standardized interactive tasks designed to evaluate how well artificial intelligence agents learn from natural language feedback, rather than traditional numerical rewards or direct action demonstrations. It takes in an agent's actions and provides rich language descriptions of the environment and feedback on its progress. The output is a measure of the agent's performance in solving various tasks, making it useful for AI researchers and developers focused on building more human-like learning systems.
No commits in the last 6 months.
Use this if you are developing or evaluating AI agents that need to learn complex tasks by understanding and responding to human-like linguistic guidance and explanations.
Not ideal if your AI agent primarily learns through numerical rewards or by observing exact action sequences, without needing to process natural language feedback.
Stars
95
Forks
18
Language
Python
License
MIT
Category
Last pushed
Jun 10, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/microsoft/LLF-Bench"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Compare
Higher-rated alternatives
stanfordnlp/axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
aidatatools/ollama-benchmark
LLM Benchmark for Throughput via Ollama (Local LLMs)
LarHope/ollama-benchmark
Ollama based Benchmark with detail I/O token per second. Python with Deepseek R1 example.
qcri/LLMeBench
Benchmarking Large Language Models
THUDM/LongBench
LongBench v2 and LongBench (ACL 25'&24')