LLMeBench and llm-optimizer-benchmark

These two tools are **complements** because one focuses on benchmarking already trained Large Language Models (LLMs) across various tasks and capabilities, while the other specifically benchmarks optimizers used during the *pretraining* phase of LLMs, addressing different stages of the LLM lifecycle.

LLMeBench
47
Emerging
Maintenance 2/25
Adoption 9/25
Maturity 17/25
Community 19/25
Maintenance 6/25
Adoption 8/25
Maturity 15/25
Community 9/25
Stars: 105
Forks: 21
Downloads:
Commits (30d): 0
Language: Python
License:
Stars: 56
Forks: 4
Downloads:
Commits (30d): 0
Language: Python
License: Apache-2.0
No License Stale 6m
No Package No Dependents

About LLMeBench

qcri/LLMeBench

Benchmarking Large Language Models

This framework helps you objectively compare how well different large language models (LLMs) perform on specific language tasks, regardless of their source (like OpenAI or HuggingFace). You provide a dataset and a task (such as sentiment analysis or question answering), and it outputs a detailed report on each model's accuracy and behavior. It's designed for AI researchers, data scientists, and language model evaluators who need to rigorously test and select the best LLM for their application.

LLM evaluation NLP benchmarking AI model comparison language model testing computational linguistics

About llm-optimizer-benchmark

epfml/llm-optimizer-benchmark

Benchmarking Optimizers for LLM Pretraining

This project offers a standardized way to compare different optimization techniques used in training Large Language Models (LLMs). It takes various optimizer configurations, model sizes, and training durations as input and produces benchmark results showing which optimizer performs best under specific conditions. LLM researchers and practitioners would use this to inform their choice of optimization methods for pretraining LLMs.

LLM pretraining Deep Learning optimization Model development AI research Language model engineering

Scores updated daily from GitHub, PyPI, and npm data. How scores work