File Content Extraction RAG Tools

Tools for extracting text, metadata, and structured data from various file formats (PDF, Office docs, images, web pages, audio). Does NOT include chunking strategies, vector storage, or post-extraction processing pipelines.

There are 63 file content extraction tools tracked. 2 score above 70 (verified tier). The highest-rated is kreuzberg-dev/kreuzberg at 79/100 with 6,689 stars. 3 of the top 10 are actively maintained.

Get all 63 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=rag&subcategory=file-content-extraction&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 kreuzberg-dev/kreuzberg

A polyglot document intelligence framework with a Rust core. Extract text,...

79
Verified
2 PaddlePaddle/PaddleOCR

Turn any PDF or image document into structured data for your AI. A powerful,...

79
Verified
3 yfedoseev/pdf_oxide

The fastest PDF library for Python and Rust. Text extraction, image...

67
Established
4 opendataloader-project/opendataloader-pdf

PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.

65
Established
5 AKSarav/pdfstract

PDFStract - The Extraction and Chunking Layer in Your RAG Pipeline -...

58
Established
6 NanoNets/docext

An on-premises, OCR-free unstructured data extraction, markdown conversion...

56
Established
7 explosion/spacy-layout

πŸ“š Process PDFs, Word documents and more with spaCy

51
Established
8 docling-project/docling-java

A Java API for Docling

51
Established
9 velocitybolt/open-extract

Structured Data Extractor for AI Agents. Search your documents or the web...

47
Emerging
10 quarkiverse/quarkus-docling

Docling simplifies document processing, parsing diverse formats β€” including...

47
Emerging
11 drmingler/smart-llm-loader

smart-llm-loader is a lightweight yet powerful Python package that...

46
Emerging
12 y3ex/ragtable-extract

Extract tables precisely from PDFs and convert them to clean HTML for RAG...

44
Emerging
13 lazyFrogLOL/llmdocparser

A package for parsing PDFs and analyzing their content using LLMs.

43
Emerging
14 anyparser/anyparser_core

Anyparser Python SDK for RAG/ETL Pipelines - File Content Extraction....

42
Emerging
15 loryanstrant/unifi-documenter

Auto generation of UniFi network documentation

41
Emerging
16 beenguelllayounes/ragtable-extract

Extract tables precisely from PDFs and convert them to clean HTML for RAG...

36
Emerging
17 novatechflow/docai

Local-first OCR β†’ Markdown β†’ RAG toolkit with optional Hugging Face/custom...

35
Emerging
18 zhangyu1818/apple-docs-for-rag

Apple Documentation Markdown For RAG

32
Emerging
19 ParthaPRay/Docling_Colab

This repo contains google colab notebook for handing Docling for data...

31
Emerging
20 Blacksuan19/structx

Type-safe structured data extraction from text using LLMs.

31
Emerging
21 Huang-lab/figure-extractor

Flask-based service using PDFFigures 2.0 to extract figures and tables from...

30
Emerging
22 mylxsw/extractor

extractor is an HTTP service used to convert PDF, Markdown, HTML, Docx,...

30
Emerging
23 DS4SD/quackling

Build document-native LLM applications

29
Experimental
24 Anecha9610/document-parser-ai

πŸ“„ Simplify data extraction from PDFs and documents using AI APIs for...

28
Experimental
25 anyparser/anyparser_langchain

Integrate Anyparser's powerful content extraction capabilities with...

28
Experimental
26 risshe92/docprobe

Universal documentation extraction tool

27
Experimental
27 anyparser/anyparser_crewai

Supercharge your AI workflows by combining Anyparser’s advanced content...

27
Experimental
28 msbayindir/rag-chunker

PDF β†’ Mistral OCR β†’ deterministic AST chunker with Anthropic contextual...

25
Experimental
29 R0mb0/DocScraper_GUI

Automate your OSINT and document research. This desktop app searches the web...

23
Experimental
30 thomassuedbroecker/docling_preprocessor_factory_public

Docling Preprocessor Factory is an open-source project that provides a...

23
Experimental
31 amirkiarafiei/docling-processor

A Docling extension for superior PDF/DOCX to Markdown conversion, featuring...

23
Experimental
32 KoDiit/llm-cerebroscope

πŸ•΅οΈ Analyze forensic data with LLM-CerebroScope, a powerful AI-driven engine...

22
Experimental
33 syw2014/langparse

LangParse is a universal document parsing and text chunking engine for LLM...

22
Experimental
34 ZhuJiaxin2/ragtable-extract

PDF table extraction for RAG β€” convert to clean HTML. Fast, local, no GPU.

22
Experimental
35 tarrantwrong366/OCR-Document-parser

πŸ“ Streamline document analysis by extracting key fields from PAN cards,...

21
Experimental
36 rlozanointel/Vromlix-AI-Engine

Cognitive ETL Engine & Architecture for Personal Knowledge Graphs....

21
Experimental
37 tbast24/docling_preprocessor_factory_public

Provide a local preprocessing pipeline to extract and standardize...

21
Experimental
38 jtgsystems/OCR-TOOL-REALTIME

πŸ“ Real-time OCR tool - Extract text from images and videos with live processing

21
Experimental
39 sitemap-ai/backend

SitemapRAG is an open-source tool designed to leverage your website's...

20
Experimental
40 qbxlvnf11/ocr-document-parser-for-rag

OCR Document Markdown/HTML Parser for RAG

19
Experimental
41 anyparser/anyparserjs

Anyparser Typescript SDK for RAG/ETL Pipelines - File Content Extraction....

19
Experimental
42 segunalabi383/Data-Extractor

Structured Data Extractor for AI Agents. Search your documents or the web...

19
Experimental
43 AlwaysSany/doc-extract-parse-index

The project is designed to streamline the workflow of extracting, parsing,...

18
Experimental
44 elchemista/doc_dig

DocDig is an Elixir wrapper around the Rust-based extractous library,...

18
Experimental
45 xaman27x/Adobe-PDF-CTD

A high-performance, multi-stage document processor with two interconnected...

16
Experimental
46 rangga276/ocr-llm-agent

πŸ–ΌοΈ Extract and process text from images with an OCR AI agent, featuring...

16
Experimental
47 muradali4442/thesis_extractor

Use text + tables from PDFs for RAG (BM25 + LLM).

16
Experimental
48 the-ai-entrepreneur-ai-hub/pdf-parser-api

PDF Parser API - Extract text, metadata & page data from PDF files via HTTP API

14
Experimental
49 MuntahaShams/Document_AI_for_Custom_Data_Extraction

Automated extraction of structured information from semi-structured...

14
Experimental
50 mkai80/DocMeld

Transform documents into structured, agent-ready knowledge efficiently with...

14
Experimental
51 qlfv/Docling-Testing

Repository for testing and demonstrating the capabilities of Docling for...

14
Experimental
52 kreuzberg-dev/.github

Kreuzberg is a fast, polyglot document intelligence engine with a Rust core....

14
Experimental
53 r00ters/tika-plus-docker

Docker image to build Apache Tika Full + JPEG2000 + JBIG2

13
Experimental
54 kevv1m/tikara

The metadata and text content extractor for almost every file type.

13
Experimental
55 johnzfitch/human-interface-markdown

Apple Human Interface Guidelines archive (1980-2014) - 35 documents...

13
Experimental
56 AhmedZeyadTareq/Llama-Parse-Content-Extraction

extract and analyze content from various file formats including PDFs, text...

13
Experimental
57 DGloi/utillity-files-to-text

Creates an endpoint to extract text content, images and document from...

13
Experimental
58 jeehoonyu/PDF_Seperator

A lightweight tool for splitting PDF documents into chapters, optimized for...

13
Experimental
59 shijincai/fast360

The industry's first "Open Source OCR Arena," a free, no-login utility for...

12
Experimental
60 anyparser/anyparser_llamaindex

Instantly access Anyparser's robust document processing and data extraction...

12
Experimental
61 rajsinghparihar/data-detective

An app that leverages LLMs to process documents, extract relevant...

11
Experimental
62 tirandagan/PDF_unstructured

This project consists of three main applications that work together to...

11
Experimental
63 priyangshu-datta/jcdl2024

exData: Tool for extracting Datasets from research articles.

10
Experimental