jehumtine/synthetic_data_generator

This script is designed to convert bodies of text into a question and answer JSON format using the GPT-4 language model. The process involves extracting text from PDF files, tokenizing the text, generating questions and answers, and then saving the results in a JSON file.

/ 100

Experimental

This tool helps you quickly turn large PDF documents, like manuals or research papers, into structured question-and-answer pairs. It takes your PDF files as input and automatically generates relevant questions and their answers using an AI model, outputting them into a standard JSON file. This is useful for educators, trainers, or content creators who need to build knowledge bases or practice materials from existing textual content.

No commits in the last 6 months.

Use this if you need to rapidly create question-and-answer datasets from your PDF documents without manually drafting each question and answer.

Not ideal if you require highly nuanced or subjective Q&A pairs that need deep human understanding or specific domain expertise not easily captured by an AI.

content-creation education training-materials knowledge-management document-processing

No License Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 6 / 25

Maturity 8 / 25

Community 13 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

—

Higher-rated alternatives

InternScience/GraphGen

GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

timothepearce/synda

A CLI for generating synthetic data

rasinmuhammed/misata

High-performance open-source synthetic data engine. Uses LLMs for schema design and vectorized...

ziegler-ingo/CRAFT

[TACL, EMNLP 2025 Oral] Code, datasets, and checkpoints for the paper "CRAFT Your Dataset:...

ZhuLinsen/FastDatasets

A powerful tool for creating high-quality training datasets for Large Language Models...

Explore LLM Tools

All categories Trending LLM Tool directory Insights