opendatalab/MinerU-HTML
MinerU-HTML: An SLM-powered HTML main content extractor that outputs clean HTML bodies. Perfect for Deep Research Agents, RAG applications, and training data generation.
This tool helps researchers, data scientists, and content analysts gather clean, relevant text from complex web pages. It takes raw HTML from websites and intelligently removes distracting elements like ads and navigation, leaving only the core article or main content. The output is a clean HTML body, ready for further analysis or use in applications like training AI models.
217 stars.
Use this if you need to reliably extract only the main textual content from many diverse web pages, discarding all the surrounding clutter.
Not ideal if you need to extract specific structured data from web pages (like product prices or addresses) rather than general main content, or if you require the full, unmodified HTML.
Stars
217
Forks
24
Language
HTML
License
Apache-2.0
Category
Last pushed
Dec 25, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/rag/opendatalab/MinerU-HTML"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
any4ai/AnyCrawl
AnyCrawl π: A Node.js/TypeScript crawler that turns websites into LLM-ready data and extracts...
kreuzberg-dev/html-to-markdown
High performance and CommonMark compliant HTML to Markdown converter. Maintained by the...
ScrapeGraphAI/Scrapegraph-ai
Python scraper based on AI
adbar/trafilatura
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping,...
paulpierre/markdown-crawler
A multithreaded πΈοΈ web crawler that recursively crawls a website and creates a π½ markdown file...