Synthetic Data Generation LLM Tools

Tools for generating synthetic datasets and training data for LLMs through various methods (QA pairs, tabular data, code, structured extraction). Does NOT include general data processing, data augmentation for images, or dataset annotation/curation platforms.

There are 29 synthetic data generation tools tracked. 2 score above 50 (established tier). The highest-rated is InternScience/GraphGen at 56/100 with 978 stars. 1 of the top 10 are actively maintained.

Get all 29 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=synthetic-data-generation&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 InternScience/GraphGen

GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven...

56
Established
2 timothepearce/synda

A CLI for generating synthetic data

52
Established
3 rasinmuhammed/misata

High-performance open-source synthetic data engine. Uses LLMs for schema...

48
Emerging
4 ziegler-ingo/CRAFT

[TACL, EMNLP 2025 Oral] Code, datasets, and checkpoints for the paper "CRAFT...

46
Emerging
5 ZhuLinsen/FastDatasets

A powerful tool for creating high-quality training datasets for Large...

45
Emerging
6 BatsResearch/bonito

A lightweight library for generating synthetic instruction tuning datasets...

43
Emerging
7 asaparov/prontoqa

Synthetic question-answering dataset to formally analyze the...

41
Emerging
8 Oqura-ai/deepresearch-datagen-cli

Using deep research workflow to generate datasets for finetuning LLMs.

38
Emerging
9 nalinrajendran/synthetic-LLM-QA-dataset-generator

Create synthetic datasets for training and testing Language Learning Models...

37
Emerging
10 Itachi-Uchiha581/Auto-Data

Auto Data is a library designed for quick and effortless creation of...

36
Emerging
11 Alannikos/edg4llm

A unified tool to generate fine-tuning datasets for LLMs, including...

36
Emerging
12 GURPREETKAURJETHRA/Synthetic-Data-Generation-using-LLM

Synthetic Data Generation using LLM via Argilla, Distilabel, ChatGPT, etc.

34
Emerging
13 kevinscaria/TarGEN

Targeted Data Generation with Large Language Models

34
Emerging
14 Glavin001/Data2AITextbook

🚀 Automatically convert unstructured data into a high-quality 'textbook'...

33
Emerging
15 copyleftdev/faux-foundry

FauxFoundry - Synthetic data generation powered by local LLMs

31
Emerging
16 jehumtine/synthetic_data_generator

This script is designed to convert bodies of text into a question and answer...

27
Experimental
17 jqwangai/SynPT

An Improved Data Synthesis Method Driven by Large Language Models for...

24
Experimental
18 dmeldrum6/synthetic-dataset

Web based tool for generating Q&A datasets from an LLM

22
Experimental
19 Pro-GenAI/DataClassifier

An AI-driven approach to Label LLM Training Data

22
Experimental
20 yzhan238/TELEClass

The source code used for paper "TELEClass: Taxonomy Enrichment and...

22
Experimental
21 MelNajkar/llm-data-augmentation-sentiment

LLM-based synthetic data generation for improving sentiment classification...

21
Experimental
22 danmurf/datakeg

Brew synthetic training data from your documentation using LLMs

21
Experimental
23 Red1998/faux-foundry

🤖 Generate unique synthetic datasets effortlessly with FauxFoundry, using...

21
Experimental
24 tiddly-gittly/TiddlyWiki-LLM-dataset

WikiText syntax dataset generation pipeline and open dataset for auto UI...

19
Experimental
25 CartographerLabs/Lights-Camera-Extremism

A Social Network Synthetic Dataset Generation Framework

19
Experimental
26 ScottishCoder/AuldLangSynth

AuldLangSynth is an open-source data-centric language synthesis platform...

17
Experimental
27 Ki-Seki/autotab

Automatically fill in missing values in tabular data using in-context...

17
Experimental
28 pezzos/jsonl_dataset_generator

Generate rich JSONL datasets from topics to fine-tune Large Language Models....

15
Experimental
29 Chessperson/multiomics-synth

R synthpop for proteomics/metabolomics cohorts (your 30 cohorts, 6k+ cols)....

14
Experimental