LLM Evaluation Benchmarking NLP Tools

Tools and frameworks for evaluating, benchmarking, and scoring large language model outputs across various dimensions (accuracy, reasoning, semantic understanding, consistency). Includes automated metrics, evaluation harnesses, and comparative testing frameworks. Does NOT include model training, fine-tuning, adaptation, or general NLP task evaluation unrelated to LLM assessment.

There are 114 llm evaluation benchmarking tools tracked. 1 score above 70 (verified tier). The highest-rated is google/langfun at 72/100 with 900 stars. 1 of the top 10 are actively maintained.

Get all 114 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=nlp&subcategory=llm-evaluation-benchmarking&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 google/langfun

OO for LLMs

72
Verified
2 tanaos/artifex

Small Language Model Inference, Fine-Tuning and Observability. No GPU, no...

58
Established
3 preligens-lab/textnoisr

Adding random noise to a text dataset, and controlling very accurately the...

56
Established
4 vulnerability-lookup/VulnTrain

A tool to generate datasets and models based on vulnerabilities descriptions...

55
Established
5 masakhane-io/masakhane-mt

Machine Translation for Africa

51
Established
6 DataScienceUIBK/HintEval

HintEval๐Ÿ’ก: A Comprehensive Framework for Hint Generation and Evaluation for Questions

50
Established
7 microsoft/LMChallenge

A library & tools to evaluate predictive language models.

50
Established
8 Maluuba/nlg-eval

Evaluation code for various unsupervised automated metrics for Natural...

49
Emerging
9 feralvam/easse

Easier Automatic Sentence Simplification Evaluation

47
Emerging
10 EleanorJiang/BlonDe

Official implementations for (1) BlonDe: An Automatic Evaluation Metric for...

47
Emerging
11 wasiahmad/PLBART

Official code of our work, Unified Pre-training for Program Understanding...

46
Emerging
12 LIAAD/tieval

An Evaluation Framework for Temporal Information Extraction Systems

46
Emerging
13 abasirat/llm-adapter

A plug-and-play adapter architecture that efficiently adapts large language...

44
Emerging
14 dataset-sh/slambda

We turn instruction and examples into plain python function powered by LLM.

44
Emerging
15 olivettigroup/materials-synthesis-generative-models

Public release of data and code for materials synthesis generation

44
Emerging
16 gcunhase/NLPMetrics

Python code for various NLP metrics

44
Emerging
17 Lambda-3/DiscourseSimplification

Extension of the SentenceSimplification project

42
Emerging
18 disi-unibo-nlp/nlg-metricverse

[COLING22] An End-to-End Library for Evaluating Natural Language Generation

42
Emerging
19 Joinn99/RocketEval-ICLR

๐Ÿš€ [ICLR '25] RocketEval: Efficient Automated LLM Evaluation via Grading Checklist

41
Emerging
20 Kyle-Ross/glyphdeck

The glyphdeck library is a comprehensive toolkit designed to streamline &...

39
Emerging
21 microsoft/Litmus

AI Assistant for Building Reliable, High-performing and Fair Multilingual NLP Systems

37
Emerging
22 namwonss/Math-Solver

Classifier for math word problems using deep learning

37
Emerging
23 VityaVitalich/TaxoLLaMA

[ACL 2024] TaxoLLaMA: WordNet-based Model for Solving Multiple Lexical Sematic Tasks

37
Emerging
24 SapienzaNLP/guardians-mt-eval

Official repository of the ACL 2024 paper "Guardians of the Machine...

36
Emerging
25 rashad101/RoMe

PyTorch code for ACL 2022 paper: RoMe: A Robust Metric for Evaluating...

36
Emerging
26 4AI/langml

A Keras-based and TensorFlow-backend NLP Models Toolkit.

36
Emerging
27 IIIIQIIII/DramaBench

A six-dimensional evaluation framework for drama script continuation with...

36
Emerging
28 zjunlp/MemBase

A Comprehensive Benchmarking Framework for Long-Term Conversational Memory Layers

36
Emerging
29 golsun/SpaceFusion

NAACL'19: "Jointly Optimizing Diversity and Relevance in Neural Response Generation"

35
Emerging
30 bassrehab/spark-llm-eval

Spark-native LLM evaluation framework with confidence intervals,...

35
Emerging
31 Sanqiang/text_simplification

Text Simplification Model based on Encoder-Decoder (includes Transformer and...

34
Emerging
32 Living-with-machines/lwm_ARTIDIGH_2020_OCR_impact_downstream_NLP_tasks

Repository for code underlying the paper 'Assessing the Impact of OCR...

34
Emerging
33 ksanu1998/static_analysis_codegen_llms

This repository contains code base for project titled Leveraging static...

33
Emerging
34 psunlpgroup/ReaLMistake

This repository includes a benchmark and code for the paper "Evaluating LLMs...

32
Emerging
35 USC-FORTIS/NLP-ADBench

[EMNLP Findings 2025]. NLP-ADBench is a comprehensive benchmarking tool...

32
Emerging
36 davidheineman/salsa

Success and Failure Linguistic Simplification Annotation ๐Ÿ’ƒ

32
Emerging
37 OSU-NLP-Group/SELM

Symmetric Encryption with Language Models

32
Emerging
38 language-brainscore/langbrainscore

[Marked for Deprecation. please visit...

32
Emerging
39 explosion/prodigy-openai-recipes

โœจ Bootstrap annotation with zero- & few-shot learning via OpenAI GPT-3

32
Emerging
40 feralvam/metaeval-simplification

Meta-evaluation of automatic metrics in Text Simplification

32
Emerging
41 lmvasque/ts-explore

Source code for Text Simplification Evaluation papers at ACL findings and...

31
Emerging
42 subramanya1997/Novel-T5

We propose to use a mode that favors sentiment understanding and empathetic...

31
Emerging
43 JonnoB/scrambledtext

A python library for creating synthetic corrupted OCR text using a markov process

31
Emerging
44 greg2451/aggregating-text-similarity-metrics

This repository consists of a benchmark of various text similarity measures...

30
Emerging
45 sileod/Discovery

Mining Discourse Markers for Unsupervised Sentence Representation Learning

30
Emerging
46 liamcripwell/control_simp

Code and resources for controllable simplification via operation classification.

30
Emerging
47 saarus72/text_normalization

T5-based (russian) text normalization

30
Emerging
48 Lambda-3/SentenceSimplification

Tool to simplify english sentences into their core and context sentences

30
Emerging
49 BM-K/KoMiniLM

Korean Light Weight Language Model

29
Experimental
50 doheejin/HiPAMA

This repository is the implementation of the HiPAMA architecture, introduced...

29
Experimental
51 2030NLP/SpaCE2021

ไธญๆ–‡็ฉบ้—ด่ฏญไน‰็†่งฃ่ฏ„ๆต‹

29
Experimental
52 zy-liu/POSSCORE

This repo is for POSSCORE, an automatic evaluation metric for the...

29
Experimental
53 roboalchemist/dynamic-baml

Python library for dynamic BAML schema generation and LLM structured data...

28
Experimental
54 sileod/DiscSense

Automated Semantic Analysis of Discourse Markers

28
Experimental
55 megagonlabs/holobench

๐Ÿซง Code for Holistic Reasoning with Long-Context LMs: A Benchmark for...

27
Experimental
56 civillibertarian-stressincontinence617/llm-autoeval

๐Ÿ› ๏ธ Simplify LLM evaluation with our Colab notebook; just name your model,...

27
Experimental
57 11NOel11/ChaosBench-Logic

Benchmark dataset and tooling for evaluating LLM logical reasoning and...

26
Experimental
58 doheejin/SB_loss_PA

This repository is the implementation of the paper, "Score-balanced Loss for...

26
Experimental
59 lancopku/meSimp

Codes for "Training Simplification and Model Simplification for Deep...

26
Experimental
60 doc-analysis/ReadingBank

ReadingBank: A Benchmark Dataset for Reading Order Detection

25
Experimental
61 chrischenhub/OnlySportsLM

SOTA Sports-domain Language Model under Billion Parameters

25
Experimental
62 miserytale/Little_Language_Model

LittleLM: A tiny character-level n-gram language model for local corpus...

25
Experimental
63 JINO-ROHIT/tachyon

a LLM inference engine to run on consumer hardware

25
Experimental
64 princeton-nlp/blindfold-textgame

[NAACL 2021] Reading and Acting while Blindfolded: The Need for Semantics in...

25
Experimental
65 rafaelsandroni/gpt3-data-labeling

Data labeling using few shot learning GPT-3.

25
Experimental
66 idramalab/quantify-llm-explanations

Evaluating Large Language Models for Detecting Antisemitism

24
Experimental
67 yancong222/ClinicalNLP2024

Python code for LLMs surprisals and linear machine learning models

23
Experimental
68 yancong222/LMs-discourse-connectives-Surprisals

On the Influence of Discourse Connectives on the Predictions of Humans and...

23
Experimental
69 dsdanielpark/all-about-llm

dsdanielpark's curation and categorization of resources on large language...

23
Experimental
70 balajeekalyan/figureout

FigureOut is a Python package allows developers to easily integrate LLM into...

22
Experimental
71 devxiongmao/llm-scorecaster

LLM-Scorecaster is a Python-based system designed to evaluate and analyze...

22
Experimental
72 licphel/LLMe

LLM trainer for personal computers.

22
Experimental
73 ylkhayat/cocolex

[ACL 2025] Codebase for CoCoLex

22
Experimental
74 Omg1221/search_evals

๐Ÿ” Evaluate web search APIs with our framework, testing accuracy and...

22
Experimental
75 BramVanroy/mai-simplification-nl-2023

Sentence-Level Text Simplification for Dutch

22
Experimental
76 Kaito1999-script/ULMEvalKit

๐Ÿ› ๏ธ Evaluate unified models effortlessly with ULMEvalKit, your open-source...

22
Experimental
77 somsubhra04/LLM_Legal_Prompt_Generation

Data and codes for the EMNLP 2023 paper 'LLMs โ€“ the Good, the Bad or the...

21
Experimental
78 audreycs/ImpScore

A repository for paper ImpScore: A Learnable Metric For Quantifying The...

21
Experimental
79 soldni/tokreate

A minimal library to create tokens using LLMs.

21
Experimental
80 alphadl/EasyBLEU

An effective and simple tool to calculate SacreBLEU, Token-BLEU, BLEU w/...

21
Experimental
81 kaganhitit11/mergeval

mergeval is a unified tool that lets you merge and evaluate large language...

21
Experimental
82 baojunshan/nlg-metrics

Natural language generation evaluation metrics

20
Experimental
83 YecanLee/2BeOETG

[ACL 2025 Workshop] Official PyTorch Implementation of "Towards Better...

20
Experimental
84 ehs9nino/traffic-ocr-llm-benchmark

Benchmark dataset for OCR + LLM document understanding in traffic and...

20
Experimental
85 BetterAndBetterII/effimemo

A Python package for managing large language model (LLM) context windows,...

20
Experimental
86 alexfdez1010/ner-llm

A system for doing NER using LLMs and LRMs

20
Experimental
87 bionlplab/isimp

A sentence simplification system

20
Experimental
88 orionw/LM-expansions

When do Generative Query and Document Expansions Fail? A Comprehensive Study...

20
Experimental
89 harvey-fin/absence-bench

Code implementation for paper AbsenceBench: Language Models Can't Tell What's Missing

19
Experimental
90 cx0/llm-typos

Impact of typos and common misspellings on LLM task performance.

19
Experimental
91 DFKI-NLP/LLMCheckup

Code for the NAACL 2024 HCI+NLP Workshop paper "LLMCheckup: Conversational...

19
Experimental
92 JonnoB/scrambledtext_analysis

Can synthetic corrupted data be used to train LLM's to correct OCR text?

19
Experimental
93 gsbm/minilm

A lightweight toolkit for experimenting with compact language models

19
Experimental
94 sileod/pragmeval

Discourse Based Evaluation of Language Understanding

18
Experimental
95 codingClaire/Structural-Code-Understanding

A Survey of Deep Learning Models for Structural Code Understanding

18
Experimental
96 Kseymur/eltex-sheets-addon

Google Sheets add-on for domain-driven synthetic data generation using LLMs.

18
Experimental
97 erayyap/lats-for-ollama

A primitive and an inefficient implementation of LATS for usage alongside...

17
Experimental
98 inteldict/CatEval

tool for constituency parsing evaluation

17
Experimental
99 daskol/lsp-lm

Language Model as a Language Server

17
Experimental
100 zircote/oolong-pairs

Benchmark harness for A/B testing Claude Code plugins against OOLONG...

16
Experimental
101 cluebbers/Reproducibility-METEOR-NLP

Investigates the reproducibility of METEOR scores in scientific papers....

16
Experimental
102 OasisSimpDataset/OasisSimpDataset.github.io

OasisSimp: An Open-source Asian-English Sentence Simplification Dataset

14
Experimental
103 soualahmohammedzakaria/Fuzzy-LM

Minimal implementation of a language model with fuzzy word matching.

14
Experimental
104 rahmanidashti/SyntheticTestCollections

[Official Codes] Synthetic Test Collections for Retrieval Evaluation (SIGIR 2024)

13
Experimental
105 imohammad12/GRS

An unsupervised approach to sentence simplification that combines text...

12
Experimental
106 adianliusie/comparative-assessment

Framework for using LLMs to grade texts by using pairwise comparisons.

12
Experimental
107 ppsmk388/MoraBench

MoraBench (Model Ranking Benchmark)

12
Experimental
108 icecola12/AgenticPOIBench-A-Realistic-Benchmark-for-Agentic-Spatiotemporal-Constrained-POI-Search

AgenticPOIBench: A Realistic Benchmark for Agentic...

11
Experimental
109 pthompson8594/SemanticUTF8

UTF-8 language model compression achieving ~66% token reduction while...

11
Experimental
110 Haiku-Legal/legaleval

LegalEval, high level framework for evaluation of legal LLMs and reasoning...

11
Experimental
111 SangeethaVenkatesan/DecodeTheDiscourse

Conversational Analyzer empowered with Cohere Language Models to enhance the...

10
Experimental
112 avnCode/Topics_in_AI

We propose a novel evaluation technique for LLMs which surpasses BeRT based...

10
Experimental
113 hipe-eval/HIPE-2022-eval

HIPE-2022 Evaluation Toolkit: All you need to reproduce the HIPE-2022...

10
Experimental
114 u-hyszk/japanese-speculative-decoding

Verification of the effect of speculative decoding in Japanese.

10
Experimental