malteos/llm-datasets
A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.
This tool helps researchers and engineers curate high-quality text data for training large language models. It takes raw text from various sources and processes it by extracting plain text, deduplicating, and preparing it into a structured format like JSONL or Parquet. This is essential for anyone building or fine-tuning custom language models for specific applications.
Used by 1 other package. No commits in the last 6 months. Available on PyPI.
Use this if you need to build a comprehensive, clean, and appropriately formatted text dataset for pre-training or fine-tuning a language model.
Not ideal if you are looking for an off-the-shelf, pre-trained language model, or if your primary need is data analysis rather than dataset preparation for model training.
Stars
64
Forks
6
Language
Python
License
Apache-2.0
Category
Last pushed
Jul 29, 2024
Commits (30d)
0
Dependencies
9
Reverse dependents
1
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/malteos/llm-datasets"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
mlabonne/llm-datasets
Curated list of datasets and tools for post-training.
magpie-align/magpie
[ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your...
jd-coderepos/llms4subjects
The official SemEval 2025 Task 5 - LLMs4Subjects - Shared Task Dataset repository
willxxy/ECG-Bench
A Unified Framework for Benchmarking Generative Electrocardiogram-Language Models (ELMs)
geobrain-ai/geogalactica
Code and datasets for paper "GeoGalactica: A Scientific Large Language Model in Geoscience"