LLM Evaluation Benchmarking NLP Tools

Tools and frameworks for evaluating, benchmarking, and scoring large language model outputs across various dimensions (accuracy, reasoning, semantic understanding, consistency). Includes automated metrics, evaluation harnesses, and comparative testing frameworks. Does NOT include model training, fine-tuning, adaptation, or general NLP task evaluation unrelated to LLM assessment.

There are 114 llm evaluation benchmarking tools tracked. 1 score above 70 (verified tier). The highest-rated is google/langfun at 72/100 with 900 stars. 1 of the top 10 are actively maintained.

Get all 114 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=nlp&subcategory=llm-evaluation-benchmarking&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Tool	Score	Tier	Stars	Language
1	google/langfun OO for LLMs	72	Verified	900	Python
2	tanaos/artifex Small Language Model Inference, Fine-Tuning and Observability. No GPU, no...	58	Established	90	Python
3	preligens-lab/textnoisr Adding random noise to a text dataset, and controlling very accurately the...	56	Established	20	Python
4	vulnerability-lookup/VulnTrain A tool to generate datasets and models based on vulnerabilities descriptions...	55	Established	23	Python
5	masakhane-io/masakhane-mt Machine Translation for Africa	51	Established	312	Lua
6	DataScienceUIBK/HintEval HintEval💡: A Comprehensive Framework for Hint Generation and Evaluation for Questions	50	Established	36	Python
7	microsoft/LMChallenge A library & tools to evaluate predictive language models.	50	Established	65	Python
8	Maluuba/nlg-eval Evaluation code for various unsupervised automated metrics for Natural...	49	Emerging	1,391	Python
9	feralvam/easse Easier Automatic Sentence Simplification Evaluation	47	Emerging	166	Roff
10	EleanorJiang/BlonDe Official implementations for (1) BlonDe: An Automatic Evaluation Metric for...	47	Emerging	83	Python
11	wasiahmad/PLBART Official code of our work, Unified Pre-training for Program Understanding...	46	Emerging	186	Python
12	LIAAD/tieval An Evaluation Framework for Temporal Information Extraction Systems	46	Emerging	20	Python
13	abasirat/llm-adapter A plug-and-play adapter architecture that efficiently adapts large language...	44	Emerging	3	Python
14	dataset-sh/slambda We turn instruction and examples into plain python function powered by LLM.	44	Emerging	3	Python
15	olivettigroup/materials-synthesis-generative-models Public release of data and code for materials synthesis generation	44	Emerging	75	HTML
16	gcunhase/NLPMetrics Python code for various NLP metrics	44	Emerging	169	Jupyter Notebook
17	Lambda-3/DiscourseSimplification Extension of the SentenceSimplification project	42	Emerging	61	Java
18	disi-unibo-nlp/nlg-metricverse [COLING22] An End-to-End Library for Evaluating Natural Language Generation	42	Emerging	94	Python
19	Joinn99/RocketEval-ICLR 🚀 [ICLR '25] RocketEval: Efficient Automated LLM Evaluation via Grading Checklist	41	Emerging	15	Python
20	Kyle-Ross/glyphdeck The glyphdeck library is a comprehensive toolkit designed to streamline &...	39	Emerging	2	Python
21	microsoft/Litmus AI Assistant for Building Reliable, High-performing and Fair Multilingual NLP Systems	37	Emerging	48	Python
22	namwonss/Math-Solver Classifier for math word problems using deep learning	37	Emerging	11	Python
23	VityaVitalich/TaxoLLaMA [ACL 2024] TaxoLLaMA: WordNet-based Model for Solving Multiple Lexical Sematic Tasks	37	Emerging	19	Python
24	SapienzaNLP/guardians-mt-eval Official repository of the ACL 2024 paper "Guardians of the Machine...	36	Emerging	10	Python
25	rashad101/RoMe PyTorch code for ACL 2022 paper: RoMe: A Robust Metric for Evaluating...	36	Emerging	10	Python
26	4AI/langml A Keras-based and TensorFlow-backend NLP Models Toolkit.	36	Emerging	12	Python
27	IIIIQIIII/DramaBench A six-dimensional evaluation framework for drama script continuation with...	36	Emerging	84	HTML
28	zjunlp/MemBase A Comprehensive Benchmarking Framework for Long-Term Conversational Memory Layers	36	Emerging	11	Python
29	golsun/SpaceFusion NAACL'19: "Jointly Optimizing Diversity and Relevance in Neural Response Generation"	35	Emerging	73	Python
30	bassrehab/spark-llm-eval Spark-native LLM evaluation framework with confidence intervals,...	35	Emerging	3	Python
31	Sanqiang/text_simplification Text Simplification Model based on Encoder-Decoder (includes Transformer and...	34	Emerging	68	Python
32	Living-with-machines/lwm_ARTIDIGH_2020_OCR_impact_downstream_NLP_tasks Repository for code underlying the paper 'Assessing the Impact of OCR...	34	Emerging	9	Jupyter Notebook
33	ksanu1998/static_analysis_codegen_llms This repository contains code base for project titled Leveraging static...	33	Emerging	5	HTML
34	psunlpgroup/ReaLMistake This repository includes a benchmark and code for the paper "Evaluating LLMs...	32	Emerging	31	Python
35	USC-FORTIS/NLP-ADBench [EMNLP Findings 2025]. NLP-ADBench is a comprehensive benchmarking tool...	32	Emerging	21	Python
36	davidheineman/salsa Success and Failure Linguistic Simplification Annotation 💃	32	Emerging	5	Python
37	OSU-NLP-Group/SELM Symmetric Encryption with Language Models	32	Emerging	13	Python
38	language-brainscore/langbrainscore [Marked for Deprecation. please visit...	32	Emerging	5	Python
39	explosion/prodigy-openai-recipes ✨ Bootstrap annotation with zero- & few-shot learning via OpenAI GPT-3	32	Emerging	322	Python
40	feralvam/metaeval-simplification Meta-evaluation of automatic metrics in Text Simplification	32	Emerging	4	Jupyter Notebook
41	lmvasque/ts-explore Source code for Text Simplification Evaluation papers at ACL findings and...	31	Emerging	3	Python
42	subramanya1997/Novel-T5 We propose to use a mode that favors sentiment understanding and empathetic...	31	Emerging	3	Jupyter Notebook
43	JonnoB/scrambledtext A python library for creating synthetic corrupted OCR text using a markov process	31	Emerging	9	Python
44	greg2451/aggregating-text-similarity-metrics This repository consists of a benchmark of various text similarity measures...	30	Emerging	2	Jupyter Notebook
45	sileod/Discovery Mining Discourse Markers for Unsupervised Sentence Representation Learning	30	Emerging	61	Jupyter Notebook
46	liamcripwell/control_simp Code and resources for controllable simplification via operation classification.	30	Emerging	2	Jupyter Notebook
47	saarus72/text_normalization T5-based (russian) text normalization	30	Emerging	26	Jupyter Notebook
48	Lambda-3/SentenceSimplification Tool to simplify english sentences into their core and context sentences	30	Emerging	6	Java
49	BM-K/KoMiniLM Korean Light Weight Language Model	29	Experimental	31	Python
50	doheejin/HiPAMA This repository is the implementation of the HiPAMA architecture, introduced...	29	Experimental	38	Python
51	2030NLP/SpaCE2021 中文空间语义理解评测	29	Experimental	39	Python
52	zy-liu/POSSCORE This repo is for POSSCORE, an automatic evaluation metric for the...	29	Experimental	5	Python
53	roboalchemist/dynamic-baml Python library for dynamic BAML schema generation and LLM structured data...	28	Experimental	2	Python
54	sileod/DiscSense Automated Semantic Analysis of Discourse Markers	28	Experimental	11	—
55	megagonlabs/holobench 🫧 Code for Holistic Reasoning with Long-Context LMs: A Benchmark for...	27	Experimental	12	Python
56	civillibertarian-stressincontinence617/llm-autoeval 🛠️ Simplify LLM evaluation with our Colab notebook; just name your model,...	27	Experimental	1	Python
57	11NOel11/ChaosBench-Logic Benchmark dataset and tooling for evaluating LLM logical reasoning and...	26	Experimental	3	Python
58	doheejin/SB_loss_PA This repository is the implementation of the paper, "Score-balanced Loss for...	26	Experimental	22	Python
59	lancopku/meSimp Codes for "Training Simplification and Model Simplification for Deep...	26	Experimental	18	C#
60	doc-analysis/ReadingBank ReadingBank: A Benchmark Dataset for Reading Order Detection	25	Experimental	117	—
61	chrischenhub/OnlySportsLM SOTA Sports-domain Language Model under Billion Parameters	25	Experimental	7	Python
62	miserytale/Little_Language_Model LittleLM: A tiny character-level n-gram language model for local corpus...	25	Experimental	4	Python
63	JINO-ROHIT/tachyon a LLM inference engine to run on consumer hardware	25	Experimental	3	Python
64	princeton-nlp/blindfold-textgame [NAACL 2021] Reading and Acting while Blindfolded: The Need for Semantics in...	25	Experimental	11	Python
65	rafaelsandroni/gpt3-data-labeling Data labeling using few shot learning GPT-3.	25	Experimental	25	Jupyter Notebook
66	idramalab/quantify-llm-explanations Evaluating Large Language Models for Detecting Antisemitism	24	Experimental	4	Python
67	yancong222/ClinicalNLP2024 Python code for LLMs surprisals and linear machine learning models	23	Experimental	1	Python
68	yancong222/LMs-discourse-connectives-Surprisals On the Influence of Discourse Connectives on the Predictions of Humans and...	23	Experimental	1	R
69	dsdanielpark/all-about-llm dsdanielpark's curation and categorization of resources on large language...	23	Experimental	14	Python
70	balajeekalyan/figureout FigureOut is a Python package allows developers to easily integrate LLM into...	22	Experimental	—	Python
71	devxiongmao/llm-scorecaster LLM-Scorecaster is a Python-based system designed to evaluate and analyze...	22	Experimental	—	Python
72	licphel/LLMe LLM trainer for personal computers.	22	Experimental	1	Python
73	ylkhayat/cocolex [ACL 2025] Codebase for CoCoLex	22	Experimental	6	Python
74	Omg1221/search_evals 🔍 Evaluate web search APIs with our framework, testing accuracy and...	22	Experimental	—	Python
75	BramVanroy/mai-simplification-nl-2023 Sentence-Level Text Simplification for Dutch	22	Experimental	6	Python
76	Kaito1999-script/ULMEvalKit 🛠️ Evaluate unified models effortlessly with ULMEvalKit, your open-source...	22	Experimental	—	Python
77	somsubhra04/LLM_Legal_Prompt_Generation Data and codes for the EMNLP 2023 paper 'LLMs – the Good, the Bad or the...	21	Experimental	7	Python
78	audreycs/ImpScore A repository for paper ImpScore: A Learnable Metric For Quantifying The...	21	Experimental	7	Python
79	soldni/tokreate A minimal library to create tokens using LLMs.	21	Experimental	6	Python
80	alphadl/EasyBLEU An effective and simple tool to calculate SacreBLEU, Token-BLEU, BLEU w/...	21	Experimental	7	Shell
81	kaganhitit11/mergeval mergeval is a unified tool that lets you merge and evaluate large language...	21	Experimental	2	Python
82	baojunshan/nlg-metrics Natural language generation evaluation metrics	20	Experimental	6	Python
83	YecanLee/2BeOETG [ACL 2025 Workshop] Official PyTorch Implementation of "Towards Better...	20	Experimental	5	R
84	ehs9nino/traffic-ocr-llm-benchmark Benchmark dataset for OCR + LLM document understanding in traffic and...	20	Experimental	1	—
85	BetterAndBetterII/effimemo A Python package for managing large language model (LLM) context windows,...	20	Experimental	3	Python
86	alexfdez1010/ner-llm A system for doing NER using LLMs and LRMs	20	Experimental	6	Python
87	bionlplab/isimp A sentence simplification system	20	Experimental	8	Java
88	orionw/LM-expansions When do Generative Query and Document Expansions Fail? A Comprehensive Study...	20	Experimental	5	Python
89	harvey-fin/absence-bench Code implementation for paper AbsenceBench: Language Models Can't Tell What's Missing	19	Experimental	18	Python
90	cx0/llm-typos Impact of typos and common misspellings on LLM task performance.	19	Experimental	19	Python
91	DFKI-NLP/LLMCheckup Code for the NAACL 2024 HCI+NLP Workshop paper "LLMCheckup: Conversational...	19	Experimental	13	Python
92	JonnoB/scrambledtext_analysis Can synthetic corrupted data be used to train LLM's to correct OCR text?	19	Experimental	1	Python
93	gsbm/minilm A lightweight toolkit for experimenting with compact language models	19	Experimental	—	Python
94	sileod/pragmeval Discourse Based Evaluation of Language Understanding	18	Experimental	21	Jupyter Notebook
95	codingClaire/Structural-Code-Understanding A Survey of Deep Learning Models for Structural Code Understanding	18	Experimental	21	Python
96	Kseymur/eltex-sheets-addon Google Sheets add-on for domain-driven synthetic data generation using LLMs.	18	Experimental	1	HTML
97	erayyap/lats-for-ollama A primitive and an inefficient implementation of LATS for usage alongside...	17	Experimental	1	Jupyter Notebook
98	inteldict/CatEval tool for constituency parsing evaluation	17	Experimental	1	Python
99	daskol/lsp-lm Language Model as a Language Server	17	Experimental	1	Python
100	zircote/oolong-pairs Benchmark harness for A/B testing Claude Code plugins against OOLONG...	16	Experimental	3	Python
101	cluebbers/Reproducibility-METEOR-NLP Investigates the reproducibility of METEOR scores in scientific papers....	16	Experimental	2	Jupyter Notebook
102	OasisSimpDataset/OasisSimpDataset.github.io OasisSimp: An Open-source Asian-English Sentence Simplification Dataset	14	Experimental	—	HTML
103	soualahmohammedzakaria/Fuzzy-LM Minimal implementation of a language model with fuzzy word matching.	14	Experimental	1	Python
104	rahmanidashti/SyntheticTestCollections [Official Codes] Synthetic Test Collections for Retrieval Evaluation (SIGIR 2024)	13	Experimental	11	—
105	imohammad12/GRS An unsupervised approach to sentence simplification that combines text...	12	Experimental	6	Roff
106	adianliusie/comparative-assessment Framework for using LLMs to grade texts by using pairwise comparisons.	12	Experimental	8	Python
107	ppsmk388/MoraBench MoraBench (Model Ranking Benchmark)	12	Experimental	5	Python
108	icecola12/AgenticPOIBench-A-Realistic-Benchmark-for-Agentic-Spatiotemporal-Constrained-POI-Search AgenticPOIBench: A Realistic Benchmark for Agentic...	11	Experimental	—	—
109	pthompson8594/SemanticUTF8 UTF-8 language model compression achieving ~66% token reduction while...	11	Experimental	—	C#
110	Haiku-Legal/legaleval LegalEval, high level framework for evaluation of legal LLMs and reasoning...	11	Experimental	—	—
111	SangeethaVenkatesan/DecodeTheDiscourse Conversational Analyzer empowered with Cohere Language Models to enhance the...	10	Experimental	2	Elixir
112	avnCode/Topics_in_AI We propose a novel evaluation technique for LLMs which surpasses BeRT based...	10	Experimental	2	Jupyter Notebook
113	hipe-eval/HIPE-2022-eval HIPE-2022 Evaluation Toolkit: All you need to reproduce the HIPE-2022...	10	Experimental	2	Python
114	u-hyszk/japanese-speculative-decoding Verification of the effect of speculative decoding in Japanese.	10	Experimental	2	Python