Unstructured-IO/unstructured
Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.
This tool helps anyone working with large language models to quickly convert complex documents like PDFs, HTML, or Word files into a clean, structured format. You feed it a variety of unstructured documents, and it gives you organized, usable data that your language models can easily understand and process. It's designed for data scientists, machine learning engineers, and researchers who need to prepare diverse document types for AI applications.
14,211 stars. Used by 36 other packages. Actively maintained with 23 commits in the last 30 days. Available on PyPI.
Use this if you need to transform a wide range of messy documents into clean, structured data for training or interacting with large language models.
Not ideal if your primary goal is basic text extraction without the need for sophisticated pre-processing or structuring for AI applications.
Stars
14,211
Forks
1,194
Language
HTML
License
Apache-2.0
Category
Last pushed
Mar 04, 2026
Commits (30d)
23
Dependencies
23
Reverse dependents
36
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/data-engineering/Unstructured-IO/unstructured"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Recent Releases
Related tools
ThePagePage/docschema
Document schema extraction framework for regulated industries. Parse complex documents into...
amikrsin/StatementSync-Lite
StatementSync is a lightweight, high-performance Progressive Web App (PWA) designed to solve the...
obieg-zero/plugin-wibor-docs
OCR, ekstrakcja danych z umow, Q&A o kontrakcie