LLM Data Labeling LLM Tools
Tools and platforms for annotating, labeling, and cleaning datasets using LLMs, including data quality management and weak supervision frameworks. Does NOT include general data processing pipelines, embeddings-only tools, or non-annotation data transformation.
There are 23 llm data labeling tools tracked. 1 score above 70 (verified tier). The highest-rated is NVIDIA-NeMo/Curator at 71/100 with 1,443 stars. 3 of the top 10 are actively maintained.
Get all 23 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=llm-data-labeling&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
NVIDIA-NeMo/Curator
Scalable data pre processing and curation toolkit for LLMs |
|
Verified |
| 2 |
MigoXLab/dingo
Dingo: A Comprehensive AI Data, Model and Application Quality Evaluation Tool |
|
Established |
| 3 |
data-prep-kit/data-prep-kit
Open source project for data preparation for GenAI applications |
|
Established |
| 4 |
TheDataStation/pneuma
LLM-Powered Data Discovery System for Tabular Data |
|
Established |
| 5 |
cleanlab/cleanlab-studio
Client interface to Cleanlab Studio |
|
Emerging |
| 6 |
jpmorganchase/CodeQuest
CodeQUEST is a generalizable framework which leverages LLMs to iteratively... |
|
Emerging |
| 7 |
GUNDAM-Labet/GUNDAM
GUNDAM is a data management system that prioritizes data using language models. |
|
Emerging |
| 8 |
nxank4/loclean
⚡️ The All-in-One Local AI Data Cleaning Library |
|
Emerging |
| 9 |
AI4Bharat/Anudesh
An open source platform to annotate data for Large language models - at scale |
|
Emerging |
| 10 |
BatsResearch/alfred
A system for prompted weak supervision. Alfred is a powerful tool that... |
|
Emerging |
| 11 |
codepawl/loclean
An AI Data Cleaning Library |
|
Emerging |
| 12 |
worldbank/llm4data
LLM4Data is a Python library designed to facilitate the application of large... |
|
Emerging |
| 13 |
saran9991/llm-data-annotation
Use Large Language Models like OpenAI's GPT-3.5 for data annotation and... |
|
Emerging |
| 14 |
hikariming/pindata
PinData is a modern, open-source dataset management platform designed... |
|
Emerging |
| 15 |
PennShenLab/FREEFORM
FREEFORM | Knowledge-Driven Feature Selection and Engineering with Large... |
|
Emerging |
| 16 |
data-prompt-query/dpq
dpq is an open-source python library that makes prompt-based data... |
|
Experimental |
| 17 |
lechmazur/writing_styles
Documents the style side of the short-story Creative Writing LLM benchmark:... |
|
Experimental |
| 18 |
tayyab-nlp/AnnotaLoop
AI-assisted document annotation with human-in-the-loop workflows |
|
Experimental |
| 19 |
MehrdadJalali-AI/LLM-ELN
Integrating LLMs with ELNs to transform materials science research at KIT,... |
|
Experimental |
| 20 |
jd-coderepos/awases-ald
A repository outlining the use of LLMs to extract structured process... |
|
Experimental |
| 21 |
dab3oon/writing_styles
📚 Analyze stylistic differences in AI-generated flash fiction to understand... |
|
Experimental |
| 22 |
CodeguruEdison/llm-tagger-api
AI-powered auto-tagging API for repair order notes — uses LLM + rules engine... |
|
Experimental |
| 23 |
Garrafao/durel_tool
Source code for DURel Annotation Tool |
|
Experimental |