LLM Evaluation Benchmarking NLP Tools
Tools and frameworks for evaluating, benchmarking, and scoring large language model outputs across various dimensions (accuracy, reasoning, semantic understanding, consistency). Includes automated metrics, evaluation harnesses, and comparative testing frameworks. Does NOT include model training, fine-tuning, adaptation, or general NLP task evaluation unrelated to LLM assessment.
There are 114 llm evaluation benchmarking tools tracked. 1 score above 70 (verified tier). The highest-rated is google/langfun at 72/100 with 900 stars. 1 of the top 10 are actively maintained.
Get all 114 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=nlp&subcategory=llm-evaluation-benchmarking&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
google/langfun
OO for LLMs |
|
Verified |
| 2 |
tanaos/artifex
Small Language Model Inference, Fine-Tuning and Observability. No GPU, no... |
|
Established |
| 3 |
preligens-lab/textnoisr
Adding random noise to a text dataset, and controlling very accurately the... |
|
Established |
| 4 |
vulnerability-lookup/VulnTrain
A tool to generate datasets and models based on vulnerabilities descriptions... |
|
Established |
| 5 |
masakhane-io/masakhane-mt
Machine Translation for Africa |
|
Established |
| 6 |
DataScienceUIBK/HintEval
HintEval๐ก: A Comprehensive Framework for Hint Generation and Evaluation for Questions |
|
Established |
| 7 |
microsoft/LMChallenge
A library & tools to evaluate predictive language models. |
|
Established |
| 8 |
Maluuba/nlg-eval
Evaluation code for various unsupervised automated metrics for Natural... |
|
Emerging |
| 9 |
feralvam/easse
Easier Automatic Sentence Simplification Evaluation |
|
Emerging |
| 10 |
EleanorJiang/BlonDe
Official implementations for (1) BlonDe: An Automatic Evaluation Metric for... |
|
Emerging |
| 11 |
wasiahmad/PLBART
Official code of our work, Unified Pre-training for Program Understanding... |
|
Emerging |
| 12 |
LIAAD/tieval
An Evaluation Framework for Temporal Information Extraction Systems |
|
Emerging |
| 13 |
abasirat/llm-adapter
A plug-and-play adapter architecture that efficiently adapts large language... |
|
Emerging |
| 14 |
dataset-sh/slambda
We turn instruction and examples into plain python function powered by LLM. |
|
Emerging |
| 15 |
olivettigroup/materials-synthesis-generative-models
Public release of data and code for materials synthesis generation |
|
Emerging |
| 16 |
gcunhase/NLPMetrics
Python code for various NLP metrics |
|
Emerging |
| 17 |
Lambda-3/DiscourseSimplification
Extension of the SentenceSimplification project |
|
Emerging |
| 18 |
disi-unibo-nlp/nlg-metricverse
[COLING22] An End-to-End Library for Evaluating Natural Language Generation |
|
Emerging |
| 19 |
Joinn99/RocketEval-ICLR
๐ [ICLR '25] RocketEval: Efficient Automated LLM Evaluation via Grading Checklist |
|
Emerging |
| 20 |
Kyle-Ross/glyphdeck
The glyphdeck library is a comprehensive toolkit designed to streamline &... |
|
Emerging |
| 21 |
microsoft/Litmus
AI Assistant for Building Reliable, High-performing and Fair Multilingual NLP Systems |
|
Emerging |
| 22 |
namwonss/Math-Solver
Classifier for math word problems using deep learning |
|
Emerging |
| 23 |
VityaVitalich/TaxoLLaMA
[ACL 2024] TaxoLLaMA: WordNet-based Model for Solving Multiple Lexical Sematic Tasks |
|
Emerging |
| 24 |
SapienzaNLP/guardians-mt-eval
Official repository of the ACL 2024 paper "Guardians of the Machine... |
|
Emerging |
| 25 |
rashad101/RoMe
PyTorch code for ACL 2022 paper: RoMe: A Robust Metric for Evaluating... |
|
Emerging |
| 26 |
4AI/langml
A Keras-based and TensorFlow-backend NLP Models Toolkit. |
|
Emerging |
| 27 |
IIIIQIIII/DramaBench
A six-dimensional evaluation framework for drama script continuation with... |
|
Emerging |
| 28 |
zjunlp/MemBase
A Comprehensive Benchmarking Framework for Long-Term Conversational Memory Layers |
|
Emerging |
| 29 |
golsun/SpaceFusion
NAACL'19: "Jointly Optimizing Diversity and Relevance in Neural Response Generation" |
|
Emerging |
| 30 |
bassrehab/spark-llm-eval
Spark-native LLM evaluation framework with confidence intervals,... |
|
Emerging |
| 31 |
Sanqiang/text_simplification
Text Simplification Model based on Encoder-Decoder (includes Transformer and... |
|
Emerging |
| 32 |
Living-with-machines/lwm_ARTIDIGH_2020_OCR_impact_downstream_NLP_tasks
Repository for code underlying the paper 'Assessing the Impact of OCR... |
|
Emerging |
| 33 |
ksanu1998/static_analysis_codegen_llms
This repository contains code base for project titled Leveraging static... |
|
Emerging |
| 34 |
psunlpgroup/ReaLMistake
This repository includes a benchmark and code for the paper "Evaluating LLMs... |
|
Emerging |
| 35 |
USC-FORTIS/NLP-ADBench
[EMNLP Findings 2025]. NLP-ADBench is a comprehensive benchmarking tool... |
|
Emerging |
| 36 |
davidheineman/salsa
Success and Failure Linguistic Simplification Annotation ๐ |
|
Emerging |
| 37 |
OSU-NLP-Group/SELM
Symmetric Encryption with Language Models |
|
Emerging |
| 38 |
language-brainscore/langbrainscore
[Marked for Deprecation. please visit... |
|
Emerging |
| 39 |
explosion/prodigy-openai-recipes
โจ Bootstrap annotation with zero- & few-shot learning via OpenAI GPT-3 |
|
Emerging |
| 40 |
feralvam/metaeval-simplification
Meta-evaluation of automatic metrics in Text Simplification |
|
Emerging |
| 41 |
lmvasque/ts-explore
Source code for Text Simplification Evaluation papers at ACL findings and... |
|
Emerging |
| 42 |
subramanya1997/Novel-T5
We propose to use a mode that favors sentiment understanding and empathetic... |
|
Emerging |
| 43 |
JonnoB/scrambledtext
A python library for creating synthetic corrupted OCR text using a markov process |
|
Emerging |
| 44 |
greg2451/aggregating-text-similarity-metrics
This repository consists of a benchmark of various text similarity measures... |
|
Emerging |
| 45 |
sileod/Discovery
Mining Discourse Markers for Unsupervised Sentence Representation Learning |
|
Emerging |
| 46 |
liamcripwell/control_simp
Code and resources for controllable simplification via operation classification. |
|
Emerging |
| 47 |
saarus72/text_normalization
T5-based (russian) text normalization |
|
Emerging |
| 48 |
Lambda-3/SentenceSimplification
Tool to simplify english sentences into their core and context sentences |
|
Emerging |
| 49 |
BM-K/KoMiniLM
Korean Light Weight Language Model |
|
Experimental |
| 50 |
doheejin/HiPAMA
This repository is the implementation of the HiPAMA architecture, introduced... |
|
Experimental |
| 51 |
2030NLP/SpaCE2021
ไธญๆ็ฉบ้ด่ฏญไน็่งฃ่ฏๆต |
|
Experimental |
| 52 |
zy-liu/POSSCORE
This repo is for POSSCORE, an automatic evaluation metric for the... |
|
Experimental |
| 53 |
roboalchemist/dynamic-baml
Python library for dynamic BAML schema generation and LLM structured data... |
|
Experimental |
| 54 |
sileod/DiscSense
Automated Semantic Analysis of Discourse Markers |
|
Experimental |
| 55 |
megagonlabs/holobench
๐ซง Code for Holistic Reasoning with Long-Context LMs: A Benchmark for... |
|
Experimental |
| 56 |
civillibertarian-stressincontinence617/llm-autoeval
๐ ๏ธ Simplify LLM evaluation with our Colab notebook; just name your model,... |
|
Experimental |
| 57 |
11NOel11/ChaosBench-Logic
Benchmark dataset and tooling for evaluating LLM logical reasoning and... |
|
Experimental |
| 58 |
doheejin/SB_loss_PA
This repository is the implementation of the paper, "Score-balanced Loss for... |
|
Experimental |
| 59 |
lancopku/meSimp
Codes for "Training Simplification and Model Simplification for Deep... |
|
Experimental |
| 60 |
doc-analysis/ReadingBank
ReadingBank: A Benchmark Dataset for Reading Order Detection |
|
Experimental |
| 61 |
chrischenhub/OnlySportsLM
SOTA Sports-domain Language Model under Billion Parameters |
|
Experimental |
| 62 |
miserytale/Little_Language_Model
LittleLM: A tiny character-level n-gram language model for local corpus... |
|
Experimental |
| 63 |
JINO-ROHIT/tachyon
a LLM inference engine to run on consumer hardware |
|
Experimental |
| 64 |
princeton-nlp/blindfold-textgame
[NAACL 2021] Reading and Acting while Blindfolded: The Need for Semantics in... |
|
Experimental |
| 65 |
rafaelsandroni/gpt3-data-labeling
Data labeling using few shot learning GPT-3. |
|
Experimental |
| 66 |
idramalab/quantify-llm-explanations
Evaluating Large Language Models for Detecting Antisemitism |
|
Experimental |
| 67 |
yancong222/ClinicalNLP2024
Python code for LLMs surprisals and linear machine learning models |
|
Experimental |
| 68 |
yancong222/LMs-discourse-connectives-Surprisals
On the Influence of Discourse Connectives on the Predictions of Humans and... |
|
Experimental |
| 69 |
dsdanielpark/all-about-llm
dsdanielpark's curation and categorization of resources on large language... |
|
Experimental |
| 70 |
balajeekalyan/figureout
FigureOut is a Python package allows developers to easily integrate LLM into... |
|
Experimental |
| 71 |
devxiongmao/llm-scorecaster
LLM-Scorecaster is a Python-based system designed to evaluate and analyze... |
|
Experimental |
| 72 |
licphel/LLMe
LLM trainer for personal computers. |
|
Experimental |
| 73 |
ylkhayat/cocolex
[ACL 2025] Codebase for CoCoLex |
|
Experimental |
| 74 |
Omg1221/search_evals
๐ Evaluate web search APIs with our framework, testing accuracy and... |
|
Experimental |
| 75 |
BramVanroy/mai-simplification-nl-2023
Sentence-Level Text Simplification for Dutch |
|
Experimental |
| 76 |
Kaito1999-script/ULMEvalKit
๐ ๏ธ Evaluate unified models effortlessly with ULMEvalKit, your open-source... |
|
Experimental |
| 77 |
somsubhra04/LLM_Legal_Prompt_Generation
Data and codes for the EMNLP 2023 paper 'LLMs โ the Good, the Bad or the... |
|
Experimental |
| 78 |
audreycs/ImpScore
A repository for paper ImpScore: A Learnable Metric For Quantifying The... |
|
Experimental |
| 79 |
soldni/tokreate
A minimal library to create tokens using LLMs. |
|
Experimental |
| 80 |
alphadl/EasyBLEU
An effective and simple tool to calculate SacreBLEU, Token-BLEU, BLEU w/... |
|
Experimental |
| 81 |
kaganhitit11/mergeval
mergeval is a unified tool that lets you merge and evaluate large language... |
|
Experimental |
| 82 |
baojunshan/nlg-metrics
Natural language generation evaluation metrics |
|
Experimental |
| 83 |
YecanLee/2BeOETG
[ACL 2025 Workshop] Official PyTorch Implementation of "Towards Better... |
|
Experimental |
| 84 |
ehs9nino/traffic-ocr-llm-benchmark
Benchmark dataset for OCR + LLM document understanding in traffic and... |
|
Experimental |
| 85 |
BetterAndBetterII/effimemo
A Python package for managing large language model (LLM) context windows,... |
|
Experimental |
| 86 |
alexfdez1010/ner-llm
A system for doing NER using LLMs and LRMs |
|
Experimental |
| 87 |
bionlplab/isimp
A sentence simplification system |
|
Experimental |
| 88 |
orionw/LM-expansions
When do Generative Query and Document Expansions Fail? A Comprehensive Study... |
|
Experimental |
| 89 |
harvey-fin/absence-bench
Code implementation for paper AbsenceBench: Language Models Can't Tell What's Missing |
|
Experimental |
| 90 |
cx0/llm-typos
Impact of typos and common misspellings on LLM task performance. |
|
Experimental |
| 91 |
DFKI-NLP/LLMCheckup
Code for the NAACL 2024 HCI+NLP Workshop paper "LLMCheckup: Conversational... |
|
Experimental |
| 92 |
JonnoB/scrambledtext_analysis
Can synthetic corrupted data be used to train LLM's to correct OCR text? |
|
Experimental |
| 93 |
gsbm/minilm
A lightweight toolkit for experimenting with compact language models |
|
Experimental |
| 94 |
sileod/pragmeval
Discourse Based Evaluation of Language Understanding |
|
Experimental |
| 95 |
codingClaire/Structural-Code-Understanding
A Survey of Deep Learning Models for Structural Code Understanding |
|
Experimental |
| 96 |
Kseymur/eltex-sheets-addon
Google Sheets add-on for domain-driven synthetic data generation using LLMs. |
|
Experimental |
| 97 |
erayyap/lats-for-ollama
A primitive and an inefficient implementation of LATS for usage alongside... |
|
Experimental |
| 98 |
inteldict/CatEval
tool for constituency parsing evaluation |
|
Experimental |
| 99 |
daskol/lsp-lm
Language Model as a Language Server |
|
Experimental |
| 100 |
zircote/oolong-pairs
Benchmark harness for A/B testing Claude Code plugins against OOLONG... |
|
Experimental |
| 101 |
cluebbers/Reproducibility-METEOR-NLP
Investigates the reproducibility of METEOR scores in scientific papers.... |
|
Experimental |
| 102 |
OasisSimpDataset/OasisSimpDataset.github.io
OasisSimp: An Open-source Asian-English Sentence Simplification Dataset |
|
Experimental |
| 103 |
soualahmohammedzakaria/Fuzzy-LM
Minimal implementation of a language model with fuzzy word matching. |
|
Experimental |
| 104 |
rahmanidashti/SyntheticTestCollections
[Official Codes] Synthetic Test Collections for Retrieval Evaluation (SIGIR 2024) |
|
Experimental |
| 105 |
imohammad12/GRS
An unsupervised approach to sentence simplification that combines text... |
|
Experimental |
| 106 |
adianliusie/comparative-assessment
Framework for using LLMs to grade texts by using pairwise comparisons. |
|
Experimental |
| 107 |
ppsmk388/MoraBench
MoraBench (Model Ranking Benchmark) |
|
Experimental |
| 108 |
icecola12/AgenticPOIBench-A-Realistic-Benchmark-for-Agentic-Spatiotemporal-Constrained-POI-Search
AgenticPOIBench: A Realistic Benchmark for Agentic... |
|
Experimental |
| 109 |
pthompson8594/SemanticUTF8
UTF-8 language model compression achieving ~66% token reduction while... |
|
Experimental |
| 110 |
Haiku-Legal/legaleval
LegalEval, high level framework for evaluation of legal LLMs and reasoning... |
|
Experimental |
| 111 |
SangeethaVenkatesan/DecodeTheDiscourse
Conversational Analyzer empowered with Cohere Language Models to enhance the... |
|
Experimental |
| 112 |
avnCode/Topics_in_AI
We propose a novel evaluation technique for LLMs which surpasses BeRT based... |
|
Experimental |
| 113 |
hipe-eval/HIPE-2022-eval
HIPE-2022 Evaluation Toolkit: All you need to reproduce the HIPE-2022... |
|
Experimental |
| 114 |
u-hyszk/japanese-speculative-decoding
Verification of the effect of speculative decoding in Japanese. |
|
Experimental |