carlosplanchon/betterhtmlchunking

BetterHTMLChunking is a Python library for intelligent HTML segmentation. It builds a DOM tree from raw HTML and extracts content-rich regions of interest, making content analysis effortless. Great for LLM based processing.

/ 100

Established

This tool helps content professionals, researchers, or data analysts to break down long HTML web pages into smaller, meaningful sections while preserving their original structure. You provide a raw HTML document, and it delivers a series of segmented HTML chunks or plain text chunks, making it easier to analyze specific parts of a document without losing context. This is ideal for anyone who needs to process web content for structured analysis or to prepare it for tools like large language models.

Available on PyPI.

Use this if you need to systematically split large HTML documents into smaller, coherent pieces for detailed analysis or further automated processing, while maintaining the hierarchical relationships of the original content.

Not ideal if you only need a basic plain-text chunking that ignores the HTML structure entirely, or if your primary goal is simple token-based splitting for language models without needing to preserve document hierarchy.

web-content-analysis document-segmentation information-extraction data-preparation content-curation

Maintenance 10 / 25

Adoption 8 / 25

Maturity 25 / 25

Community 12 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Related tools

gustavoespindola/chunkerizer

Split and analyze text files using langchain and streamlit

Explore NLP Tools

All categories Trending NLP directory Insights