opendatalab/MinerU-HTML

MinerU-HTML: An SLM-powered HTML main content extractor that outputs clean HTML bodies. Perfect for Deep Research Agents, RAG applications, and training data generation.

45
/ 100
Emerging

This tool helps researchers, data scientists, and content analysts gather clean, relevant text from complex web pages. It takes raw HTML from websites and intelligently removes distracting elements like ads and navigation, leaving only the core article or main content. The output is a clean HTML body, ready for further analysis or use in applications like training AI models.

217 stars.

Use this if you need to reliably extract only the main textual content from many diverse web pages, discarding all the surrounding clutter.

Not ideal if you need to extract specific structured data from web pages (like product prices or addresses) rather than general main content, or if you require the full, unmodified HTML.

web-scraping data-collection content-analysis research-automation AI-training-data
No Package No Dependents
Maintenance 6 / 25
Adoption 10 / 25
Maturity 13 / 25
Community 16 / 25

How are scores calculated?

Stars

217

Forks

24

Language

HTML

License

Apache-2.0

Last pushed

Dec 25, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/rag/opendatalab/MinerU-HTML"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.