seedatnabeel/CLLM
Curated LLM (ICML 2024)
This project helps data scientists and machine learning engineers when they need to train models but have very little real-world data. It takes your small original dataset and generates additional high-quality, synthetic tabular data. This expanded dataset can then be used for more robust model training, especially in scenarios where collecting more real data is difficult or costly.
No commits in the last 6 months.
Use this if you are a data scientist or ML engineer working with tabular data and struggling with model performance due to an insufficient amount of training examples.
Not ideal if your primary goal is to generate text or image data, or if you already have abundant real data for your tabular machine learning tasks.
Stars
14
Forks
4
Language
Jupyter Notebook
License
Apache-2.0
Category
Last pushed
Oct 23, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/seedatnabeel/CLLM"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
mlabonne/llm-datasets
Curated list of datasets and tools for post-training.
malteos/llm-datasets
A collection of datasets for language model pretraining including scripts for downloading,...
magpie-align/magpie
[ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your...
jd-coderepos/llms4subjects
The official SemEval 2025 Task 5 - LLMs4Subjects - Shared Task Dataset repository
willxxy/ECG-Bench
A Unified Framework for Benchmarking Generative Electrocardiogram-Language Models (ELMs)