LLMeBench and LLF-Bench
Both projects are direct competitors, offering distinct benchmarks for evaluating large language models, with LLMeBench focusing on general LLM performance and LLF-Bench specializing in learning agents guided by language feedback.
About LLMeBench
qcri/LLMeBench
Benchmarking Large Language Models
This framework helps you objectively compare how well different large language models (LLMs) perform on specific language tasks, regardless of their source (like OpenAI or HuggingFace). You provide a dataset and a task (such as sentiment analysis or question answering), and it outputs a detailed report on each model's accuracy and behavior. It's designed for AI researchers, data scientists, and language model evaluators who need to rigorously test and select the best LLM for their application.
About LLF-Bench
microsoft/LLF-Bench
A benchmark for evaluating learning agents based on just language feedback
This project provides a set of standardized interactive tasks designed to evaluate how well artificial intelligence agents learn from natural language feedback, rather than traditional numerical rewards or direct action demonstrations. It takes in an agent's actions and provides rich language descriptions of the environment and feedback on its progress. The output is a measure of the agent's performance in solving various tasks, making it useful for AI researchers and developers focused on building more human-like learning systems.
Related comparisons
Scores updated daily from GitHub, PyPI, and npm data. How scores work