EastTower16/LLMDataDistill

distill large scale web page text

21
/ 100
Experimental

This tool helps researchers and data scientists who work with large text datasets, especially web page content like news articles or blog posts. It takes a massive collection of raw web text as input and processes it to remove duplicate entries and filter out low-quality, marketing-heavy content. The output is a cleaner, more focused dataset suitable for training large language models or for in-depth analysis.

No commits in the last 6 months.

Use this if you need to clean and pre-process an extremely large, raw corpus of web page text, like the WuDao dataset, to prepare it for machine learning applications or detailed content analysis.

Not ideal if your data is not web page text, if you need a solution for smaller datasets, or if you don't have access to a CUDA-enabled GPU.

text-corpus-preparation web-content-curation large-language-model-training data-quality-management natural-language-processing
Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 5 / 25
Maturity 16 / 25
Community 0 / 25

How are scores calculated?

Stars

12

Forks

Language

C++

License

Apache-2.0

Last pushed

Jul 29, 2023

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/EastTower16/LLMDataDistill"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.