BhabhaAI/dataformer

Solving data for LLMs - Create quality synthetic datasets!

/ 100

Emerging

This project helps AI engineers efficiently create large, high-quality synthetic datasets to train their AI models. It takes a small set of instructions or examples and generates diverse, production-ready data, helping to reduce compute costs and improve model performance. It is designed for AI developers and machine learning engineers who need to quickly generate data without relying on extensive real-world datasets.

151 stars. No commits in the last 6 months.

Use this if you are an AI engineer who needs to rapidly produce high-quality synthetic data to train and fine-tune your large language models.

Not ideal if you are looking for a tool to process and clean existing real-world datasets rather than generate new ones.

AI development Machine learning engineering LLM training Data generation Synthetic data

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 12 / 25

How are scores calculated?

Stars

151

Forks

Language

Python

License

Apache-2.0

Higher-rated alternatives

VikParuchuri/textbook_quality

Generate textbook-quality synthetic LLM pretraining data

dmanuel64/codablellm

A framework for creating and curating high-quality code datasets tailored for large language models

BothBosu/Synthetic-Data-for-Scam-Detection-Leveraging-LLMs-to-Train-Deep-Learning-Models

This repository contains the source code and synthetic datasets used in the research on scam...

iiis-ai/TemplateMath

[ICLR 2025 DATA-FM] Training and Evaluating Language Models with Template-based Data Generation...

MichiganNLP/depression_synthetic_data

Can LMs generate useful synthetic data for the mental health domain?

Explore Transformer Models

All categories Trending Transformer directory Insights