carlosplanchon/betterhtmlchunking
BetterHTMLChunking is a Python library for intelligent HTML segmentation. It builds a DOM tree from raw HTML and extracts content-rich regions of interest, making content analysis effortless. Great for LLM based processing.
This tool helps content professionals, researchers, or data analysts to break down long HTML web pages into smaller, meaningful sections while preserving their original structure. You provide a raw HTML document, and it delivers a series of segmented HTML chunks or plain text chunks, making it easier to analyze specific parts of a document without losing context. This is ideal for anyone who needs to process web content for structured analysis or to prepare it for tools like large language models.
Available on PyPI.
Use this if you need to systematically split large HTML documents into smaller, coherent pieces for detailed analysis or further automated processing, while maintaining the hierarchical relationships of the original content.
Not ideal if you only need a basic plain-text chunking that ignores the HTML structure entirely, or if your primary goal is simple token-based splitting for language models without needing to preserve document hierarchy.
Stars
54
Forks
7
Language
Python
License
MIT
Category
Last pushed
Mar 07, 2026
Commits (30d)
0
Dependencies
7
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/carlosplanchon/betterhtmlchunking"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.