EastTower16/LLMDataDistill

distill large scale web page text

/ 100

Experimental

This tool helps researchers and data scientists who work with large text datasets, especially web page content like news articles or blog posts. It takes a massive collection of raw web text as input and processes it to remove duplicate entries and filter out low-quality, marketing-heavy content. The output is a cleaner, more focused dataset suitable for training large language models or for in-depth analysis.

No commits in the last 6 months.

Use this if you need to clean and pre-process an extremely large, raw corpus of web page text, like the WuDao dataset, to prepare it for machine learning applications or detailed content analysis.

Not ideal if your data is not web page text, if you need a solution for smaller datasets, or if you don't have access to a CUDA-enabled GPU.

text-corpus-preparation web-content-curation large-language-model-training data-quality-management natural-language-processing

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 5 / 25

Maturity 16 / 25

Community 0 / 25

How are scores calculated?

Stars

Forks

—

Language

C++

License

Apache-2.0

Higher-rated alternatives

jncraton/languagemodels

Explore large language models in 512MB of RAM

microsoft/unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

haizelabs/verdict

Inference-time scaling for LLMs-as-a-judge.

albertan017/LLM4Decompile

Reverse Engineering: Decompiling Binary Code with Large Language Models

bytedance/Sa2VA

Official Repo For Pixel-LLM Codebase

Explore Transformer Models

All categories Trending Transformer directory Insights