allenai/dolma
Data and tools for generating and inspecting OLMo pre-training data.
This project provides both a massive, open-source dataset and a powerful toolkit for preparing data to train large language models. It takes in diverse raw web content, academic papers, code, books, and encyclopedic materials, and outputs a curated, cleaned dataset of text. The primary users are machine learning researchers and engineers developing large language models.
1,447 stars. Used by 1 other package. Available on PyPI.
Use this if you are developing large language models and need a vast, pre-curated text dataset or robust tools to create your own high-quality training corpora efficiently.
Not ideal if you are looking for a pre-trained language model or a tool for general text analysis tasks, as its focus is specifically on data preparation for LLM training.
Stars
1,447
Forks
178
Language
Python
License
Apache-2.0
Category
Last pushed
Nov 05, 2025
Commits (30d)
0
Dependencies
24
Reverse dependents
1
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/allenai/dolma"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related models
waikato-llm/llm-dataset-converter
For converting LLM datasets from one format into another.
refuel-ai/autolabel
Label, clean and enrich text datasets with LLMs.
niclasgriesshaber/llm_patent_pipeline
LLMs for Historical Dataset Construction from Archival Image Scans
cgxjdzz/FeatureForge-LLM
FeatureForge LLM is a Python package that leverages large language models (LLMs) to automate and...
codeastra2/llm-feat
Automated feature engineering using Large Language Models (LLMs) for tabular data