kreuzberg-dev/kreuzberg
A polyglot document intelligence framework with a Rust core. Extract text, metadata, and structured information from PDFs, Office documents, images, and 76+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.
This tool helps you quickly and accurately extract information from a wide range of documents and code files. It takes in various file types like PDFs, Office documents, images, and programming files, and outputs structured text, metadata, and even detailed code elements like functions and classes. This is ideal for developers who need to process large volumes of diverse documents or code for tasks like building search engines, RAG pipelines, or document analysis systems.
6,689 stars. Used by 6 other packages. Actively maintained with 731 commits in the last 30 days. Available on PyPI.
Use this if you need to reliably extract content and structure from nearly any document or programming file for automated processing or analysis.
Not ideal if you only need basic text extraction from a single, consistent document type and don't require advanced metadata or code intelligence.
Stars
6,689
Forks
316
Language
Rust
License
MIT
Category
Last pushed
Mar 12, 2026
Commits (30d)
731
Reverse dependents
6
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/rag/kreuzberg-dev/kreuzberg"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Compare
Related tools
PaddlePaddle/PaddleOCR
Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR...
yfedoseev/pdf_oxide
The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown...
opendataloader-project/opendataloader-pdf
PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.
AKSarav/pdfstract
PDFStract - The Extraction and Chunking Layer in Your RAG Pipeline - Available as CLI - WEBUI - API
NanoNets/docext
An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking...