microsoft/LLF-Bench

A benchmark for evaluating learning agents based on just language feedback

/ 100

Emerging

This project provides a set of standardized interactive tasks designed to evaluate how well artificial intelligence agents learn from natural language feedback, rather than traditional numerical rewards or direct action demonstrations. It takes in an agent's actions and provides rich language descriptions of the environment and feedback on its progress. The output is a measure of the agent's performance in solving various tasks, making it useful for AI researchers and developers focused on building more human-like learning systems.

No commits in the last 6 months.

Use this if you are developing or evaluating AI agents that need to learn complex tasks by understanding and responding to human-like linguistic guidance and explanations.

Not ideal if your AI agent primarily learns through numerical rewards or by observing exact action sequences, without needing to process natural language feedback.

AI-evaluation interactive-learning language-understanding agent-development human-AI-interaction

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 9 / 25

Maturity 16 / 25

Community 18 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Compare

LLF-Bench and LLMeBench

Higher-rated alternatives

stanfordnlp/axbench

Stanford NLP Python library for benchmarking the utility of LLM interpretability methods

aidatatools/ollama-benchmark

LLM Benchmark for Throughput via Ollama (Local LLMs)

LarHope/ollama-benchmark

Ollama based Benchmark with detail I/O token per second. Python with Deepseek R1 example.

qcri/LLMeBench

Benchmarking Large Language Models

THUDM/LongBench

LongBench v2 and LongBench (ACL 25'&24')

Explore Transformer Models

All categories Trending Transformer directory Insights