File Content Extraction RAG Tools
Tools for extracting text, metadata, and structured data from various file formats (PDF, Office docs, images, web pages, audio). Does NOT include chunking strategies, vector storage, or post-extraction processing pipelines.
There are 63 file content extraction tools tracked. 2 score above 70 (verified tier). The highest-rated is kreuzberg-dev/kreuzberg at 79/100 with 6,689 stars. 3 of the top 10 are actively maintained.
Get all 63 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=rag&subcategory=file-content-extraction&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
kreuzberg-dev/kreuzberg
A polyglot document intelligence framework with a Rust core. Extract text,... |
|
Verified |
| 2 |
PaddlePaddle/PaddleOCR
Turn any PDF or image document into structured data for your AI. A powerful,... |
|
Verified |
| 3 |
yfedoseev/pdf_oxide
The fastest PDF library for Python and Rust. Text extraction, image... |
|
Established |
| 4 |
opendataloader-project/opendataloader-pdf
PDF Parser for AI-ready data. Automate PDF accessibility. Open-source. |
|
Established |
| 5 |
AKSarav/pdfstract
PDFStract - The Extraction and Chunking Layer in Your RAG Pipeline -... |
|
Established |
| 6 |
NanoNets/docext
An on-premises, OCR-free unstructured data extraction, markdown conversion... |
|
Established |
| 7 |
explosion/spacy-layout
π Process PDFs, Word documents and more with spaCy |
|
Established |
| 8 |
docling-project/docling-java
A Java API for Docling |
|
Established |
| 9 |
velocitybolt/open-extract
Structured Data Extractor for AI Agents. Search your documents or the web... |
|
Emerging |
| 10 |
quarkiverse/quarkus-docling
Docling simplifies document processing, parsing diverse formats β including... |
|
Emerging |
| 11 |
drmingler/smart-llm-loader
smart-llm-loader is a lightweight yet powerful Python package that... |
|
Emerging |
| 12 |
y3ex/ragtable-extract
Extract tables precisely from PDFs and convert them to clean HTML for RAG... |
|
Emerging |
| 13 |
lazyFrogLOL/llmdocparser
A package for parsing PDFs and analyzing their content using LLMs. |
|
Emerging |
| 14 |
anyparser/anyparser_core
Anyparser Python SDK for RAG/ETL Pipelines - File Content Extraction.... |
|
Emerging |
| 15 |
loryanstrant/unifi-documenter
Auto generation of UniFi network documentation |
|
Emerging |
| 16 |
beenguelllayounes/ragtable-extract
Extract tables precisely from PDFs and convert them to clean HTML for RAG... |
|
Emerging |
| 17 |
novatechflow/docai
Local-first OCR β Markdown β RAG toolkit with optional Hugging Face/custom... |
|
Emerging |
| 18 |
zhangyu1818/apple-docs-for-rag
Apple Documentation Markdown For RAG |
|
Emerging |
| 19 |
ParthaPRay/Docling_Colab
This repo contains google colab notebook for handing Docling for data... |
|
Emerging |
| 20 |
Blacksuan19/structx
Type-safe structured data extraction from text using LLMs. |
|
Emerging |
| 21 |
Huang-lab/figure-extractor
Flask-based service using PDFFigures 2.0 to extract figures and tables from... |
|
Emerging |
| 22 |
mylxsw/extractor
extractor is an HTTP service used to convert PDF, Markdown, HTML, Docx,... |
|
Emerging |
| 23 |
DS4SD/quackling
Build document-native LLM applications |
|
Experimental |
| 24 |
Anecha9610/document-parser-ai
π Simplify data extraction from PDFs and documents using AI APIs for... |
|
Experimental |
| 25 |
anyparser/anyparser_langchain
Integrate Anyparser's powerful content extraction capabilities with... |
|
Experimental |
| 26 |
risshe92/docprobe
Universal documentation extraction tool |
|
Experimental |
| 27 |
anyparser/anyparser_crewai
Supercharge your AI workflows by combining Anyparserβs advanced content... |
|
Experimental |
| 28 |
msbayindir/rag-chunker
PDF β Mistral OCR β deterministic AST chunker with Anthropic contextual... |
|
Experimental |
| 29 |
R0mb0/DocScraper_GUI
Automate your OSINT and document research. This desktop app searches the web... |
|
Experimental |
| 30 |
thomassuedbroecker/docling_preprocessor_factory_public
Docling Preprocessor Factory is an open-source project that provides a... |
|
Experimental |
| 31 |
amirkiarafiei/docling-processor
A Docling extension for superior PDF/DOCX to Markdown conversion, featuring... |
|
Experimental |
| 32 |
KoDiit/llm-cerebroscope
π΅οΈ Analyze forensic data with LLM-CerebroScope, a powerful AI-driven engine... |
|
Experimental |
| 33 |
syw2014/langparse
LangParse is a universal document parsing and text chunking engine for LLM... |
|
Experimental |
| 34 |
ZhuJiaxin2/ragtable-extract
PDF table extraction for RAG β convert to clean HTML. Fast, local, no GPU. |
|
Experimental |
| 35 |
tarrantwrong366/OCR-Document-parser
π Streamline document analysis by extracting key fields from PAN cards,... |
|
Experimental |
| 36 |
rlozanointel/Vromlix-AI-Engine
Cognitive ETL Engine & Architecture for Personal Knowledge Graphs.... |
|
Experimental |
| 37 |
tbast24/docling_preprocessor_factory_public
Provide a local preprocessing pipeline to extract and standardize... |
|
Experimental |
| 38 |
jtgsystems/OCR-TOOL-REALTIME
π Real-time OCR tool - Extract text from images and videos with live processing |
|
Experimental |
| 39 |
sitemap-ai/backend
SitemapRAG is an open-source tool designed to leverage your website's... |
|
Experimental |
| 40 |
qbxlvnf11/ocr-document-parser-for-rag
OCR Document Markdown/HTML Parser for RAG |
|
Experimental |
| 41 |
anyparser/anyparserjs
Anyparser Typescript SDK for RAG/ETL Pipelines - File Content Extraction.... |
|
Experimental |
| 42 |
segunalabi383/Data-Extractor
Structured Data Extractor for AI Agents. Search your documents or the web... |
|
Experimental |
| 43 |
AlwaysSany/doc-extract-parse-index
The project is designed to streamline the workflow of extracting, parsing,... |
|
Experimental |
| 44 |
elchemista/doc_dig
DocDig is an Elixir wrapper around the Rust-based extractous library,... |
|
Experimental |
| 45 |
xaman27x/Adobe-PDF-CTD
A high-performance, multi-stage document processor with two interconnected... |
|
Experimental |
| 46 |
rangga276/ocr-llm-agent
πΌοΈ Extract and process text from images with an OCR AI agent, featuring... |
|
Experimental |
| 47 |
muradali4442/thesis_extractor
Use text + tables from PDFs for RAG (BM25 + LLM). |
|
Experimental |
| 48 |
the-ai-entrepreneur-ai-hub/pdf-parser-api
PDF Parser API - Extract text, metadata & page data from PDF files via HTTP API |
|
Experimental |
| 49 |
MuntahaShams/Document_AI_for_Custom_Data_Extraction
Automated extraction of structured information from semi-structured... |
|
Experimental |
| 50 |
mkai80/DocMeld
Transform documents into structured, agent-ready knowledge efficiently with... |
|
Experimental |
| 51 |
qlfv/Docling-Testing
Repository for testing and demonstrating the capabilities of Docling for... |
|
Experimental |
| 52 |
kreuzberg-dev/.github
Kreuzberg is a fast, polyglot document intelligence engine with a Rust core.... |
|
Experimental |
| 53 |
r00ters/tika-plus-docker
Docker image to build Apache Tika Full + JPEG2000 + JBIG2 |
|
Experimental |
| 54 |
kevv1m/tikara
The metadata and text content extractor for almost every file type. |
|
Experimental |
| 55 |
johnzfitch/human-interface-markdown
Apple Human Interface Guidelines archive (1980-2014) - 35 documents... |
|
Experimental |
| 56 |
AhmedZeyadTareq/Llama-Parse-Content-Extraction
extract and analyze content from various file formats including PDFs, text... |
|
Experimental |
| 57 |
DGloi/utillity-files-to-text
Creates an endpoint to extract text content, images and document from... |
|
Experimental |
| 58 |
jeehoonyu/PDF_Seperator
A lightweight tool for splitting PDF documents into chapters, optimized for... |
|
Experimental |
| 59 |
shijincai/fast360
The industry's first "Open Source OCR Arena," a free, no-login utility for... |
|
Experimental |
| 60 |
anyparser/anyparser_llamaindex
Instantly access Anyparser's robust document processing and data extraction... |
|
Experimental |
| 61 |
rajsinghparihar/data-detective
An app that leverages LLMs to process documents, extract relevant... |
|
Experimental |
| 62 |
tirandagan/PDF_unstructured
This project consists of three main applications that work together to... |
|
Experimental |
| 63 |
priyangshu-datta/jcdl2024
exData: Tool for extracting Datasets from research articles. |
|
Experimental |