opendatalab/MinerU-HTML

MinerU-HTML: An SLM-powered HTML main content extractor that outputs clean HTML bodies. Perfect for Deep Research Agents, RAG applications, and training data generation.

/ 100

Emerging

This tool helps researchers, data scientists, and content analysts gather clean, relevant text from complex web pages. It takes raw HTML from websites and intelligently removes distracting elements like ads and navigation, leaving only the core article or main content. The output is a clean HTML body, ready for further analysis or use in applications like training AI models.

217 stars.

Use this if you need to reliably extract only the main textual content from many diverse web pages, discarding all the surrounding clutter.

Not ideal if you need to extract specific structured data from web pages (like product prices or addresses) rather than general main content, or if you require the full, unmodified HTML.

web-scraping data-collection content-analysis research-automation AI-training-data

No Package No Dependents

Maintenance 6 / 25

Adoption 10 / 25

Maturity 13 / 25

Community 16 / 25

How are scores calculated?

Stars

217

Forks

Language

HTML

License

Apache-2.0

Higher-rated alternatives

any4ai/AnyCrawl

AnyCrawl 🚀: A Node.js/TypeScript crawler that turns websites into LLM-ready data and extracts...

kreuzberg-dev/html-to-markdown

High performance and CommonMark compliant HTML to Markdown converter. Maintained by the...

ScrapeGraphAI/Scrapegraph-ai

Python scraper based on AI

adbar/trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping,...

paulpierre/markdown-crawler

A multithreaded 🕸️ web crawler that recursively crawls a website and creates a 🔽 markdown file...

Explore RAG Tools

All categories Trending RAG directory Insights