cxcscmu/Craw4LLM
Official repository for "Craw4LLM: Efficient Web Crawling for LLM Pretraining"
This tool helps AI researchers efficiently gather high-quality web data for training large language models (LLMs). You input the ClueWeb22 dataset and seed documents, and it outputs a refined collection of document IDs, which can then be converted into full text for pretraining. It's designed for machine learning researchers and engineers working on foundational LLM development.
650 stars. No commits in the last 6 months.
Use this if you need to build a massive, curated web corpus from the ClueWeb22 dataset for pretraining your next large language model.
Not ideal if you're looking for a general-purpose web scraper for small-scale data collection or personal projects.
Stars
650
Forks
60
Language
Python
License
MIT
Category
Last pushed
Feb 24, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/cxcscmu/Craw4LLM"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
AI-Planning/l2p
Library for LLM-driven action model acquisition via natural language
datawhalechina/self-llm
《开源大模型食用指南》针对中国宝宝量身打造的基于Linux环境快速微调(全参数/Lora)、部署国内外开源大模型(LLM)/多模态大模型(MLLM)教程
microsoft/LMOps
General technology for enabling AI capabilities w/ LLMs and MLLMs
theaniketgiri/create-llm
The fastest way to build and start training your own LLM. CLI tool that scaffolds...
liguodongiot/llm-action
本项目旨在分享大模型相关技术原理以及实战经验(大模型工程化、大模型应用落地)