BhabhaAI/dataformer
Solving data for LLMs - Create quality synthetic datasets!
This project helps AI engineers efficiently create large, high-quality synthetic datasets to train their AI models. It takes a small set of instructions or examples and generates diverse, production-ready data, helping to reduce compute costs and improve model performance. It is designed for AI developers and machine learning engineers who need to quickly generate data without relying on extensive real-world datasets.
151 stars. No commits in the last 6 months.
Use this if you are an AI engineer who needs to rapidly produce high-quality synthetic data to train and fine-tune your large language models.
Not ideal if you are looking for a tool to process and clean existing real-world datasets rather than generate new ones.
Stars
151
Forks
12
Language
Python
License
Apache-2.0
Category
Last pushed
Jan 20, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/BhabhaAI/dataformer"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
VikParuchuri/textbook_quality
Generate textbook-quality synthetic LLM pretraining data
dmanuel64/codablellm
A framework for creating and curating high-quality code datasets tailored for large language models
BothBosu/Synthetic-Data-for-Scam-Detection-Leveraging-LLMs-to-Train-Deep-Learning-Models
This repository contains the source code and synthetic datasets used in the research on scam...
iiis-ai/TemplateMath
[ICLR 2025 DATA-FM] Training and Evaluating Language Models with Template-based Data Generation...
MichiganNLP/depression_synthetic_data
Can LMs generate useful synthetic data for the mental health domain?