cxcscmu/Craw4LLM

Official repository for "Craw4LLM: Efficient Web Crawling for LLM Pretraining"

43
/ 100
Emerging

This tool helps AI researchers efficiently gather high-quality web data for training large language models (LLMs). You input the ClueWeb22 dataset and seed documents, and it outputs a refined collection of document IDs, which can then be converted into full text for pretraining. It's designed for machine learning researchers and engineers working on foundational LLM development.

650 stars. No commits in the last 6 months.

Use this if you need to build a massive, curated web corpus from the ClueWeb22 dataset for pretraining your next large language model.

Not ideal if you're looking for a general-purpose web scraper for small-scale data collection or personal projects.

LLM Pretraining Web Corpus Creation Dataset Curation AI Research Large Language Models
Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 17 / 25

How are scores calculated?

Stars

650

Forks

60

Language

Python

License

MIT

Last pushed

Feb 24, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/cxcscmu/Craw4LLM"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.