EastTower16/LLMDataDistill
distill large scale web page text
This tool helps researchers and data scientists who work with large text datasets, especially web page content like news articles or blog posts. It takes a massive collection of raw web text as input and processes it to remove duplicate entries and filter out low-quality, marketing-heavy content. The output is a cleaner, more focused dataset suitable for training large language models or for in-depth analysis.
No commits in the last 6 months.
Use this if you need to clean and pre-process an extremely large, raw corpus of web page text, like the WuDao dataset, to prepare it for machine learning applications or detailed content analysis.
Not ideal if your data is not web page text, if you need a solution for smaller datasets, or if you don't have access to a CUDA-enabled GPU.
Stars
12
Forks
—
Language
C++
License
Apache-2.0
Category
Last pushed
Jul 29, 2023
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/EastTower16/LLMDataDistill"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
jncraton/languagemodels
Explore large language models in 512MB of RAM
microsoft/unilm
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
haizelabs/verdict
Inference-time scaling for LLMs-as-a-judge.
albertan017/LLM4Decompile
Reverse Engineering: Decompiling Binary Code with Large Language Models
bytedance/Sa2VA
Official Repo For Pixel-LLM Codebase