LLMeBench and LLF-Bench

Both projects are direct competitors, offering distinct benchmarks for evaluating large language models, with LLMeBench focusing on general LLM performance and LLF-Bench specializing in learning agents guided by language feedback.

LLMeBench
47
Emerging
LLF-Bench
45
Emerging
Maintenance 2/25
Adoption 9/25
Maturity 17/25
Community 19/25
Maintenance 2/25
Adoption 9/25
Maturity 16/25
Community 18/25
Stars: 105
Forks: 21
Downloads:
Commits (30d): 0
Language: Python
License:
Stars: 95
Forks: 18
Downloads:
Commits (30d): 0
Language: Python
License: MIT
No License Stale 6m
Stale 6m No Package No Dependents

About LLMeBench

qcri/LLMeBench

Benchmarking Large Language Models

This framework helps you objectively compare how well different large language models (LLMs) perform on specific language tasks, regardless of their source (like OpenAI or HuggingFace). You provide a dataset and a task (such as sentiment analysis or question answering), and it outputs a detailed report on each model's accuracy and behavior. It's designed for AI researchers, data scientists, and language model evaluators who need to rigorously test and select the best LLM for their application.

LLM evaluation NLP benchmarking AI model comparison language model testing computational linguistics

About LLF-Bench

microsoft/LLF-Bench

A benchmark for evaluating learning agents based on just language feedback

This project provides a set of standardized interactive tasks designed to evaluate how well artificial intelligence agents learn from natural language feedback, rather than traditional numerical rewards or direct action demonstrations. It takes in an agent's actions and provides rich language descriptions of the environment and feedback on its progress. The output is a measure of the agent's performance in solving various tasks, making it useful for AI researchers and developers focused on building more human-like learning systems.

AI-evaluation interactive-learning language-understanding agent-development human-AI-interaction

Scores updated daily from GitHub, PyPI, and npm data. How scores work