malteos/llm-datasets

A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.

/ 100

Emerging

This tool helps researchers and engineers curate high-quality text data for training large language models. It takes raw text from various sources and processes it by extracting plain text, deduplicating, and preparing it into a structured format like JSONL or Parquet. This is essential for anyone building or fine-tuning custom language models for specific applications.

Used by 1 other package. No commits in the last 6 months. Available on PyPI.

Use this if you need to build a comprehensive, clean, and appropriately formatted text dataset for pre-training or fine-tuning a language model.

Not ideal if you are looking for an off-the-shelf, pre-trained language model, or if your primary need is data analysis rather than dataset preparation for model training.

natural-language-processing machine-learning-engineering data-curation text-mining language-model-training

Stale 6m

Maintenance 0 / 25

Adoption 9 / 25

Maturity 25 / 25

Community 11 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

Apache-2.0

Higher-rated alternatives

mlabonne/llm-datasets

Curated list of datasets and tools for post-training.

magpie-align/magpie

[ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your...

jd-coderepos/llms4subjects

The official SemEval 2025 Task 5 - LLMs4Subjects - Shared Task Dataset repository

willxxy/ECG-Bench

A Unified Framework for Benchmarking Generative Electrocardiogram-Language Models (ELMs)

geobrain-ai/geogalactica

Code and datasets for paper "GeoGalactica: A Scientific Large Language Model in Geoscience"

Explore Transformer Models

All categories Trending Transformer directory Insights