LLM Data Labeling LLM Tools

Tools and platforms for annotating, labeling, and cleaning datasets using LLMs, including data quality management and weak supervision frameworks. Does NOT include general data processing pipelines, embeddings-only tools, or non-annotation data transformation.

There are 23 llm data labeling tools tracked. 1 score above 70 (verified tier). The highest-rated is NVIDIA-NeMo/Curator at 71/100 with 1,443 stars. 3 of the top 10 are actively maintained.

Get all 23 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=llm-data-labeling&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 NVIDIA-NeMo/Curator

Scalable data pre processing and curation toolkit for LLMs

71
Verified
2 MigoXLab/dingo

Dingo: A Comprehensive AI Data, Model and Application Quality Evaluation Tool

64
Established
3 data-prep-kit/data-prep-kit

Open source project for data preparation for GenAI applications

64
Established
4 TheDataStation/pneuma

LLM-Powered Data Discovery System for Tabular Data

50
Established
5 cleanlab/cleanlab-studio

Client interface to Cleanlab Studio

49
Emerging
6 jpmorganchase/CodeQuest

CodeQUEST is a generalizable framework which leverages LLMs to iteratively...

47
Emerging
7 GUNDAM-Labet/GUNDAM

GUNDAM is a data management system that prioritizes data using language models.

46
Emerging
8 nxank4/loclean

⚡️ The All-in-One Local AI Data Cleaning Library

43
Emerging
9 AI4Bharat/Anudesh

An open source platform to annotate data for Large language models - at scale

41
Emerging
10 BatsResearch/alfred

A system for prompted weak supervision. Alfred is a powerful tool that...

39
Emerging
11 codepawl/loclean

An AI Data Cleaning Library

38
Emerging
12 worldbank/llm4data

LLM4Data is a Python library designed to facilitate the application of large...

33
Emerging
13 saran9991/llm-data-annotation

Use Large Language Models like OpenAI's GPT-3.5 for data annotation and...

33
Emerging
14 hikariming/pindata

PinData is a modern, open-source dataset management platform designed...

30
Emerging
15 PennShenLab/FREEFORM

FREEFORM | Knowledge-Driven Feature Selection and Engineering with Large...

30
Emerging
16 data-prompt-query/dpq

dpq is an open-source python library that makes prompt-based data...

27
Experimental
17 lechmazur/writing_styles

Documents the style side of the short-story Creative Writing LLM benchmark:...

27
Experimental
18 tayyab-nlp/AnnotaLoop

AI-assisted document annotation with human-in-the-loop workflows

23
Experimental
19 MehrdadJalali-AI/LLM-ELN

Integrating LLMs with ELNs to transform materials science research at KIT,...

21
Experimental
20 jd-coderepos/awases-ald

A repository outlining the use of LLMs to extract structured process...

20
Experimental
21 dab3oon/writing_styles

📚 Analyze stylistic differences in AI-generated flash fiction to understand...

20
Experimental
22 CodeguruEdison/llm-tagger-api

AI-powered auto-tagging API for repair order notes — uses LLM + rules engine...

14
Experimental
23 Garrafao/durel_tool

Source code for DURel Annotation Tool

10
Experimental

Comparisons in this category