Unstructured-IO/unstructured

Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.

79
/ 100
Verified

This tool helps anyone working with large language models to quickly convert complex documents like PDFs, HTML, or Word files into a clean, structured format. You feed it a variety of unstructured documents, and it gives you organized, usable data that your language models can easily understand and process. It's designed for data scientists, machine learning engineers, and researchers who need to prepare diverse document types for AI applications.

14,211 stars. Used by 36 other packages. Actively maintained with 23 commits in the last 30 days. Available on PyPI.

Use this if you need to transform a wide range of messy documents into clean, structured data for training or interacting with large language models.

Not ideal if your primary goal is basic text extraction without the need for sophisticated pre-processing or structuring for AI applications.

document-processing data-preparation natural-language-processing machine-learning-engineering AI-data-pipelines
Maintenance 20 / 25
Adoption 15 / 25
Maturity 25 / 25
Community 19 / 25

How are scores calculated?

Stars

14,211

Forks

1,194

Language

HTML

License

Apache-2.0

Last pushed

Mar 04, 2026

Commits (30d)

23

Dependencies

23

Reverse dependents

36

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/data-engineering/Unstructured-IO/unstructured"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.