cxcscmu/Craw4LLM

Official repository for "Craw4LLM: Efficient Web Crawling for LLM Pretraining"

/ 100

Emerging

This tool helps AI researchers efficiently gather high-quality web data for training large language models (LLMs). You input the ClueWeb22 dataset and seed documents, and it outputs a refined collection of document IDs, which can then be converted into full text for pretraining. It's designed for machine learning researchers and engineers working on foundational LLM development.

650 stars. No commits in the last 6 months.

Use this if you need to build a massive, curated web corpus from the ClueWeb22 dataset for pretraining your next large language model.

Not ideal if you're looking for a general-purpose web scraper for small-scale data collection or personal projects.

LLM Pretraining Web Corpus Creation Dataset Curation AI Research Large Language Models

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 17 / 25

How are scores calculated?

Stars

650

Forks

Language

Python

License

MIT

Higher-rated alternatives

AI-Planning/l2p

Library for LLM-driven action model acquisition via natural language

datawhalechina/self-llm

《开源大模型食用指南》针对中国宝宝量身打造的基于Linux环境快速微调（全参数/Lora）、部署国内外开源大模型（LLM）/多模态大模型（MLLM）教程

microsoft/LMOps

General technology for enabling AI capabilities w/ LLMs and MLLMs

theaniketgiri/create-llm

The fastest way to build and start training your own LLM. CLI tool that scaffolds...

liguodongiot/llm-action

本项目旨在分享大模型相关技术原理以及实战经验（大模型工程化、大模型应用落地）

Explore LLM Tools

All categories Trending LLM Tool directory Insights