Synthetic Data Generation LLM Tools
Tools for generating synthetic datasets and training data for LLMs through various methods (QA pairs, tabular data, code, structured extraction). Does NOT include general data processing, data augmentation for images, or dataset annotation/curation platforms.
There are 29 synthetic data generation tools tracked. 2 score above 50 (established tier). The highest-rated is InternScience/GraphGen at 56/100 with 978 stars. 1 of the top 10 are actively maintained.
Get all 29 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=synthetic-data-generation&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
InternScience/GraphGen
GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven... |
|
Established |
| 2 |
timothepearce/synda
A CLI for generating synthetic data |
|
Established |
| 3 |
rasinmuhammed/misata
High-performance open-source synthetic data engine. Uses LLMs for schema... |
|
Emerging |
| 4 |
ziegler-ingo/CRAFT
[TACL, EMNLP 2025 Oral] Code, datasets, and checkpoints for the paper "CRAFT... |
|
Emerging |
| 5 |
ZhuLinsen/FastDatasets
A powerful tool for creating high-quality training datasets for Large... |
|
Emerging |
| 6 |
BatsResearch/bonito
A lightweight library for generating synthetic instruction tuning datasets... |
|
Emerging |
| 7 |
asaparov/prontoqa
Synthetic question-answering dataset to formally analyze the... |
|
Emerging |
| 8 |
Oqura-ai/deepresearch-datagen-cli
Using deep research workflow to generate datasets for finetuning LLMs. |
|
Emerging |
| 9 |
nalinrajendran/synthetic-LLM-QA-dataset-generator
Create synthetic datasets for training and testing Language Learning Models... |
|
Emerging |
| 10 |
Itachi-Uchiha581/Auto-Data
Auto Data is a library designed for quick and effortless creation of... |
|
Emerging |
| 11 |
Alannikos/edg4llm
A unified tool to generate fine-tuning datasets for LLMs, including... |
|
Emerging |
| 12 |
GURPREETKAURJETHRA/Synthetic-Data-Generation-using-LLM
Synthetic Data Generation using LLM via Argilla, Distilabel, ChatGPT, etc. |
|
Emerging |
| 13 |
kevinscaria/TarGEN
Targeted Data Generation with Large Language Models |
|
Emerging |
| 14 |
Glavin001/Data2AITextbook
🚀 Automatically convert unstructured data into a high-quality 'textbook'... |
|
Emerging |
| 15 |
copyleftdev/faux-foundry
FauxFoundry - Synthetic data generation powered by local LLMs |
|
Emerging |
| 16 |
jehumtine/synthetic_data_generator
This script is designed to convert bodies of text into a question and answer... |
|
Experimental |
| 17 |
jqwangai/SynPT
An Improved Data Synthesis Method Driven by Large Language Models for... |
|
Experimental |
| 18 |
dmeldrum6/synthetic-dataset
Web based tool for generating Q&A datasets from an LLM |
|
Experimental |
| 19 |
Pro-GenAI/DataClassifier
An AI-driven approach to Label LLM Training Data |
|
Experimental |
| 20 |
yzhan238/TELEClass
The source code used for paper "TELEClass: Taxonomy Enrichment and... |
|
Experimental |
| 21 |
MelNajkar/llm-data-augmentation-sentiment
LLM-based synthetic data generation for improving sentiment classification... |
|
Experimental |
| 22 |
danmurf/datakeg
Brew synthetic training data from your documentation using LLMs |
|
Experimental |
| 23 |
Red1998/faux-foundry
🤖 Generate unique synthetic datasets effortlessly with FauxFoundry, using... |
|
Experimental |
| 24 |
tiddly-gittly/TiddlyWiki-LLM-dataset
WikiText syntax dataset generation pipeline and open dataset for auto UI... |
|
Experimental |
| 25 |
CartographerLabs/Lights-Camera-Extremism
A Social Network Synthetic Dataset Generation Framework |
|
Experimental |
| 26 |
ScottishCoder/AuldLangSynth
AuldLangSynth is an open-source data-centric language synthesis platform... |
|
Experimental |
| 27 |
Ki-Seki/autotab
Automatically fill in missing values in tabular data using in-context... |
|
Experimental |
| 28 |
pezzos/jsonl_dataset_generator
Generate rich JSONL datasets from topics to fine-tune Large Language Models.... |
|
Experimental |
| 29 |
Chessperson/multiomics-synth
R synthpop for proteomics/metabolomics cohorts (your 30 cohorts, 6k+ cols).... |
|
Experimental |