kreuzberg and pdf_oxide
These are competitors offering overlapping document extraction capabilities—both extract text and metadata from PDFs and other formats—though pdf_oxide specializes in performance-critical scenarios while kreuzberg emphasizes broad format coverage (76+ formats vs. primarily PDFs).
About kreuzberg
kreuzberg-dev/kreuzberg
A polyglot document intelligence framework with a Rust core. Extract text, metadata, and structured information from PDFs, Office documents, images, and 76+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.
This tool helps you quickly and accurately extract information from a wide range of documents and code files. It takes in various file types like PDFs, Office documents, images, and programming files, and outputs structured text, metadata, and even detailed code elements like functions and classes. This is ideal for developers who need to process large volumes of diverse documents or code for tasks like building search engines, RAG pipelines, or document analysis systems.
About pdf_oxide
yfedoseev/pdf_oxide
The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.
This tool helps you quickly get information out of PDF documents, convert them to other formats, or even fill out forms. You can feed it individual PDF files or entire batches, and it will give you back the raw text, images, structured data like tables, or converted Markdown/HTML files. It's designed for anyone who needs to process many PDFs efficiently, such as data analysts, researchers, or operations managers.
Scores updated daily from GitHub, PyPI, and npm data. How scores work