explosion/spacy-layout
📚 Process PDFs, Word documents and more with spaCy
This project helps convert your unstructured PDFs, Word documents, and other similar files into clean, structured data. It takes your documents as input and outputs organized text, including identified sections, headings, and tables. This tool is designed for anyone who needs to extract specific information from complex documents for further analysis or integration into AI-powered systems.
869 stars. No commits in the last 6 months. Available on PyPI.
Use this if you need to reliably extract text, identify document structure like headings, and pull data from tables within your PDFs and Word documents for things like data analysis or building AI applications.
Not ideal if you only need simple text extraction without any desire to understand the document's layout or extract tabular data.
Stars
869
Forks
61
Language
Python
License
MIT
Category
Last pushed
Mar 08, 2025
Commits (30d)
0
Dependencies
4
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/rag/explosion/spacy-layout"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
kreuzberg-dev/kreuzberg
A polyglot document intelligence framework with a Rust core. Extract text, metadata, and...
PaddlePaddle/PaddleOCR
Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR...
yfedoseev/pdf_oxide
The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown...
opendataloader-project/opendataloader-pdf
PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.
AKSarav/pdfstract
PDFStract - The Extraction and Chunking Layer in Your RAG Pipeline - Available as CLI - WEBUI - API