seedatnabeel/CLLM

Curated LLM (ICML 2024)

/ 100

Emerging

This project helps data scientists and machine learning engineers when they need to train models but have very little real-world data. It takes your small original dataset and generates additional high-quality, synthetic tabular data. This expanded dataset can then be used for more robust model training, especially in scenarios where collecting more real data is difficult or costly.

No commits in the last 6 months.

Use this if you are a data scientist or ML engineer working with tabular data and struggling with model performance due to an insufficient amount of training examples.

Not ideal if your primary goal is to generate text or image data, or if you already have abundant real data for your tabular machine learning tasks.

data-augmentation low-data-scenarios tabular-data machine-learning-engineering synthetic-data-generation

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 5 / 25

Maturity 16 / 25

Community 15 / 25

How are scores calculated?

Stars

Forks

Language

Jupyter Notebook

License

Apache-2.0

Higher-rated alternatives

mlabonne/llm-datasets

Curated list of datasets and tools for post-training.

malteos/llm-datasets

A collection of datasets for language model pretraining including scripts for downloading,...

magpie-align/magpie

[ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your...

jd-coderepos/llms4subjects

The official SemEval 2025 Task 5 - LLMs4Subjects - Shared Task Dataset repository

willxxy/ECG-Bench

A Unified Framework for Benchmarking Generative Electrocardiogram-Language Models (ELMs)

Explore Transformer Models

All categories Trending Transformer directory Insights