File Content Extraction RAG Tools

Tools for extracting text, metadata, and structured data from various file formats (PDF, Office docs, images, web pages, audio). Does NOT include chunking strategies, vector storage, or post-extraction processing pipelines.

There are 63 file content extraction tools tracked. 2 score above 70 (verified tier). The highest-rated is kreuzberg-dev/kreuzberg at 79/100 with 6,689 stars. 3 of the top 10 are actively maintained.

Get all 63 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=rag&subcategory=file-content-extraction&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Tool	Score	Tier	Stars	Language
1	kreuzberg-dev/kreuzberg A polyglot document intelligence framework with a Rust core. Extract text,...	79	Verified	6,689	Rust
2	PaddlePaddle/PaddleOCR Turn any PDF or image document into structured data for your AI. A powerful,...	79	Verified	72,167	Python
3	yfedoseev/pdf_oxide The fastest PDF library for Python and Rust. Text extraction, image...	67	Established	421	Rust
4	opendataloader-project/opendataloader-pdf PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.	65	Established	1,958	Java
5	AKSarav/pdfstract PDFStract - The Extraction and Chunking Layer in Your RAG Pipeline -...	58	Established	128	Python
6	NanoNets/docext An on-premises, OCR-free unstructured data extraction, markdown conversion...	56	Established	1,871	Python
7	explosion/spacy-layout 📚 Process PDFs, Word documents and more with spaCy	51	Established	869	Python
8	docling-project/docling-java A Java API for Docling	51	Established	87	Java
9	velocitybolt/open-extract Structured Data Extractor for AI Agents. Search your documents or the web...	47	Emerging	185	Python
10	quarkiverse/quarkus-docling Docling simplifies document processing, parsing diverse formats — including...	47	Emerging	17	Java
11	drmingler/smart-llm-loader smart-llm-loader is a lightweight yet powerful Python package that...	46	Emerging	75	Python
12	y3ex/ragtable-extract Extract tables precisely from PDFs and convert them to clean HTML for RAG...	44	Emerging	1	HTML
13	lazyFrogLOL/llmdocparser A package for parsing PDFs and analyzing their content using LLMs.	43	Emerging	269	Python
14	anyparser/anyparser_core Anyparser Python SDK for RAG/ETL Pipelines - File Content Extraction....	42	Emerging	2	Python
15	loryanstrant/unifi-documenter Auto generation of UniFi network documentation	41	Emerging	2	Python
16	beenguelllayounes/ragtable-extract Extract tables precisely from PDFs and convert them to clean HTML for RAG...	36	Emerging	1	HTML
17	novatechflow/docai Local-first OCR → Markdown → RAG toolkit with optional Hugging Face/custom...	35	Emerging	1	Python
18	zhangyu1818/apple-docs-for-rag Apple Documentation Markdown For RAG	32	Emerging	41	CoffeeScript
19	ParthaPRay/Docling_Colab This repo contains google colab notebook for handing Docling for data...	31	Emerging	4	Jupyter Notebook
20	Blacksuan19/structx Type-safe structured data extraction from text using LLMs.	31	Emerging	10	Python
21	Huang-lab/figure-extractor Flask-based service using PDFFigures 2.0 to extract figures and tables from...	30	Emerging	15	Python
22	mylxsw/extractor extractor is an HTTP service used to convert PDF, Markdown, HTML, Docx,...	30	Emerging	6	Python
23	DS4SD/quackling Build document-native LLM applications	29	Experimental	56	Python
24	Anecha9610/document-parser-ai 📄 Simplify data extraction from PDFs and documents using AI APIs for...	28	Experimental	3	Python
25	anyparser/anyparser_langchain Integrate Anyparser's powerful content extraction capabilities with...	28	Experimental	3	Python
26	risshe92/docprobe Universal documentation extraction tool	27	Experimental	6	Python
27	anyparser/anyparser_crewai Supercharge your AI workflows by combining Anyparser’s advanced content...	27	Experimental	2	Python
28	msbayindir/rag-chunker PDF → Mistral OCR → deterministic AST chunker with Anthropic contextual...	25	Experimental	1	TypeScript
29	R0mb0/DocScraper_GUI Automate your OSINT and document research. This desktop app searches the web...	23	Experimental	1	Python
30	thomassuedbroecker/docling_preprocessor_factory_public Docling Preprocessor Factory is an open-source project that provides a...	23	Experimental	2	Python
31	amirkiarafiei/docling-processor A Docling extension for superior PDF/DOCX to Markdown conversion, featuring...	23	Experimental	2	Python
32	KoDiit/llm-cerebroscope 🕵️ Analyze forensic data with LLM-CerebroScope, a powerful AI-driven engine...	22	Experimental	—	Python
33	syw2014/langparse LangParse is a universal document parsing and text chunking engine for LLM...	22	Experimental	4	Python
34	ZhuJiaxin2/ragtable-extract PDF table extraction for RAG — convert to clean HTML. Fast, local, no GPU.	22	Experimental	1	HTML
35	tarrantwrong366/OCR-Document-parser 📝 Streamline document analysis by extracting key fields from PAN cards,...	21	Experimental	—	Python
36	rlozanointel/Vromlix-AI-Engine Cognitive ETL Engine & Architecture for Personal Knowledge Graphs....	21	Experimental	—	Python
37	tbast24/docling_preprocessor_factory_public Provide a local preprocessing pipeline to extract and standardize...	21	Experimental	—	Python
38	jtgsystems/OCR-TOOL-REALTIME 📝 Real-time OCR tool - Extract text from images and videos with live processing	21	Experimental	—	Python
39	sitemap-ai/backend SitemapRAG is an open-source tool designed to leverage your website's...	20	Experimental	8	Python
40	qbxlvnf11/ocr-document-parser-for-rag OCR Document Markdown/HTML Parser for RAG	19	Experimental	—	Python
41	anyparser/anyparserjs Anyparser Typescript SDK for RAG/ETL Pipelines - File Content Extraction....	19	Experimental	3	TypeScript
42	segunalabi383/Data-Extractor Structured Data Extractor for AI Agents. Search your documents or the web...	19	Experimental	—	Python
43	AlwaysSany/doc-extract-parse-index The project is designed to streamline the workflow of extracting, parsing,...	18	Experimental	1	JavaScript
44	elchemista/doc_dig DocDig is an Elixir wrapper around the Rust-based extractous library,...	18	Experimental	1	Elixir
45	xaman27x/Adobe-PDF-CTD A high-performance, multi-stage document processor with two interconnected...	16	Experimental	1	Python
46	rangga276/ocr-llm-agent 🖼️ Extract and process text from images with an OCR AI agent, featuring...	16	Experimental	1	Python
47	muradali4442/thesis_extractor Use text + tables from PDFs for RAG (BM25 + LLM).	16	Experimental	1	Python
48	the-ai-entrepreneur-ai-hub/pdf-parser-api PDF Parser API - Extract text, metadata & page data from PDF files via HTTP API	14	Experimental	—	JavaScript
49	MuntahaShams/Document_AI_for_Custom_Data_Extraction Automated extraction of structured information from semi-structured...	14	Experimental	—	Jupyter Notebook
50	mkai80/DocMeld Transform documents into structured, agent-ready knowledge efficiently with...	14	Experimental	—	Python
51	qlfv/Docling-Testing Repository for testing and demonstrating the capabilities of Docling for...	14	Experimental	—	HTML
52	kreuzberg-dev/.github Kreuzberg is a fast, polyglot document intelligence engine with a Rust core....	14	Experimental	1	—
53	r00ters/tika-plus-docker Docker image to build Apache Tika Full + JPEG2000 + JBIG2	13	Experimental	—	Dockerfile
54	kevv1m/tikara The metadata and text content extractor for almost every file type.	13	Experimental	—	—
55	johnzfitch/human-interface-markdown Apple Human Interface Guidelines archive (1980-2014) - 35 documents...	13	Experimental	—	—
56	AhmedZeyadTareq/Llama-Parse-Content-Extraction extract and analyze content from various file formats including PDFs, text...	13	Experimental	—	Python
57	DGloi/utillity-files-to-text Creates an endpoint to extract text content, images and document from...	13	Experimental	—	Python
58	jeehoonyu/PDF_Seperator A lightweight tool for splitting PDF documents into chapters, optimized for...	13	Experimental	—	Python
59	shijincai/fast360 The industry's first "Open Source OCR Arena," a free, no-login utility for...	12	Experimental	3	—
60	anyparser/anyparser_llamaindex Instantly access Anyparser's robust document processing and data extraction...	12	Experimental	1	Python
61	rajsinghparihar/data-detective An app that leverages LLMs to process documents, extract relevant...	11	Experimental	—	Python
62	tirandagan/PDF_unstructured This project consists of three main applications that work together to...	11	Experimental	—	Python
63	priyangshu-datta/jcdl2024 exData: Tool for extracting Datasets from research articles.	10	Experimental	2	Python

Comparisons in this category

PaddleOCR and opendataloader-pdf (79 vs 65) kreuzberg and pdf_oxide (79 vs 67)