PDF Document Processing RAG Tools

Tools and systems for extracting, parsing, and retrieving information from PDF documents through OCR, layout analysis, and structured data conversion. Does NOT include general chatbots, multi-source document handling beyond PDFs, or chat interfaces built on top of processed PDFs.

There are 65 pdf document processing tools tracked. 2 score above 50 (established tier). The highest-rated is thiswillbeyourgithub/wdoc at 60/100 with 510 stars.

Get all 65 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=rag&subcategory=pdf-document-processing&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 thiswillbeyourgithub/wdoc

Summarize and query from a lot of heterogeneous documents. Any LLM provider,...

60
Established
2 Arterning/DeepParseX

DeepParseX 是一个强大的多模态文档解析与知识管理平台,支持 PDF、Word、Excel、PPT、图片、视频、音频...

51
Established
3 NoEdgeAI/pdfdeal

A python wrapper for the Doc2X API and comes with native texts processing...

48
Emerging
4 laxmimerit/RAGWire

Production-grade RAG toolkit — ingest PDFs, DOCX, XLSX into Qdrant with LLM...

44
Emerging
5 David-Lolly/ViewRAG

图文并茂的 PDF RAG 系统:支持版式感知分块、图表深度理解与精准视觉溯源。 Multimodal PDF RAG: Features...

42
Emerging
6 atpuxiner/docsloader

This is a documents loader. (文档解析加载器,rag文档解析,rag知识库构建)

41
Emerging
7 3DCF-Labs/doc2dataset

3DCF / doc2dataset: token-efficient document layer with NumGuard numeric...

40
Emerging
8 preprocess-co/rag-document-viewer

RAG Document Viewer is an open-source library that generates high-fidelity...

37
Emerging
9 zzstoatzz/raggy

scraping and querying documents for LLMs

33
Emerging
10 ManiAm/RAG-Mail

RAG-Mail is a thread-aware email processing system that semantically indexes...

33
Emerging
11 e-kotov/rdocdump

rdocdump: Dump ‘R’ Package Source, Documentation, and Vignettes into One File

32
Emerging
12 salameaz/pdf-process-rag

A Python-based application that extracts and processes PDF content using a...

31
Emerging
13 antoninomariarizzo/rag

A Python library for Retrieval-Augmented Generation (RAG) that extracts text...

30
Emerging
14 MalayAgr/bookacle

bookacle is a RAPTOR-based RAG application to aid in understanding complex...

30
Emerging
15 MohammedNasserAhmed/RAGPost

RAGPost is an intelligent blog post generator that leverages...

30
Emerging
16 AKSHAYINDIA05/Document_Comparison_System

Implement a Retrieval Augmented Generation (RAG) with a user interface for...

29
Experimental
17 natanhp/PythoRAG

PythoRAG is a simple, open-source project designed to facilitate...

28
Experimental
18 iamarunbrahma/rag-ingest

RAG-Ingest: A tool for converting PDFs to markdown and indexing them for...

27
Experimental
19 Besthope-Official/predoc

Preprocess document service for RAG (Retriveal Augumented Generation)

27
Experimental
20 ParthSareen/simple-rag

Too many docs? Quickly search over any PDF or Markdown documents

27
Experimental
21 SStephanJX/Snowflake-RAG-System

Production-ready Snowflake RAG system with type-specific chunking

26
Experimental
22 liunian-Jay/MU-GOT

PDF Parsing Tool: GOT's vLLM acceleration implementation, MinerU for layout...

25
Experimental
23 juhaodong/large-file-translator

Extract the content while preserving the layout, images, and tables. Perform...

25
Experimental
24 este6an13/checks-ocr

Software that applies OCR + RAG to extract bank checks information

23
Experimental
25 lolbigtime/Folio

Zero-config Swifty RAG toolkit for iOS & macOS — PDF/text loaders, universal...

22
Experimental
26 salim-lakhal/rag-document-pipeline

Production RAG pipeline: multi-format document extraction → intelligent...

22
Experimental
27 Nexialism-Friday/hwpx-toolkit

HWP/HWPX document processing toolkit — extraction, generation, vectorization...

22
Experimental
28 slvg01/90.10d_RAG_OnTheFly

An app allowing to upload files (ppt, doc, pdf, zip) and RAG on their content

22
Experimental
29 Vibhuarvind/Content-Engine-RAG-for-PDF

Content Engine is RAG system that analyzes and compares multiple PDF...

22
Experimental
30 FrostWillmott/FinDocBot

Modern RAG, designed for semantic search and question-answering over...

21
Experimental
31 yotaken/docuggez

Automatic project documentator

21
Experimental
32 JochiRaider/sievio

Sievio turns GitHub, local repos, and web PDFs into clean JSONL for LLM...

21
Experimental
33 JuliaGenAI/DocsScraper.jl

Efficient RAG knowledge pack creator from online Julia documentation

21
Experimental
34 Clearedge-AI/clearedge

Build a RAG preprocessing pipeline

21
Experimental
35 S0lkar/IntGathering-x-RAG--BlazingDocs

RAG-based tool for document batch querying.

19
Experimental
36 silas-rickards/PDF-LLM-RAG

A RAG pipeline specialized for local pdfs.

18
Experimental
37 sfkunal/librarian

Librarian is a RAG-assisted LLM application that allows any user to query...

17
Experimental
38 A-Najjar/rag-factory

Modular RAG system with Factory Pattern - Load PDF/Word docs, configure...

17
Experimental
39 husaynirfan1/PullData

RAG with response in what you need. Output directly with supported format...

17
Experimental
40 solomonjie/rag-processor

RAG index pipeline, from raw data clean to index. each step communicate via...

14
Experimental
41 alrafiabdullah/doc_rag

Document RAG with HuggingFace Token

14
Experimental
42 yagmur-kurtbas/pdf-rag-pipeline

A RAG pipeline for PDF question answering using LangChain, ChromaDB and Groq...

14
Experimental
43 ahmad-albasha/DataForg

PDF to JSON pipeline with intelligent bilingual chunking (AR/EN) and a fully...

14
Experimental
44 ritheesh-dev/Local-PDF-RAG-System

Privacy-first local PDF RAG system using FAISS + Ollama — fully offline,...

14
Experimental
45 avocatt/ocr-rag-highlighted-viewer

OCR + RAG document viewer with highlighted search results

14
Experimental
46 2dogsandanerd/rag_pdf_audit

Tool to compare pdf extraction methods

14
Experimental
47 fllin1/mawa

RAG workflow (Mistral OCR + Gemini) for complex regulatory PDFs....

14
Experimental
48 julicq/PDF-RAG-Query

RAG model for PDF database

13
Experimental
49 shivkhurana/technical-docs-rag-pipeline

Enterprise-grade RAG (Retrieval Augmented Generation) pipeline using...

13
Experimental
50 will695672804/graphrag-engineering-pdfs

🔍 Extract entities and build knowledge graphs from large engineering PDFs,...

13
Experimental
51 andersborgabiro/RagQueryDocuments

RAG application that makes it easy to search in multiple documents

13
Experimental
52 ashwyan/local-llm-pdf-analyzer

A local AI tool using Ollama (Llama 3) to analyze PDF documents and generate...

13
Experimental
53 Qinnovation123/papers

PDF embedding workflow

13
Experimental
54 adrianizmi/Simple-RAG

Minimalist RAG system built from scratch using Python, local embeddings, and...

13
Experimental
55 mshojaei77/DataSpeakGPT

Read files and images and retrieve data for LLM

13
Experimental
56 zenmakhlouf/arabic-rag-pipeline

A single-file RAG pipeline for Arabic PDF lectures with two-stage retrieval,...

13
Experimental
57 nkarast/ask-my-pdf

A RAG application using local LLM to answer questions given a PDF.

12
Experimental
58 siddharth-nandagopal/billionaires-rag-query

Billionaires RAG Query uses LLMs and a RAG framework to analyze the world's...

11
Experimental
59 bazilicum/pdf-query

This project processes and retrieves information from PDF file or PDF...

11
Experimental
60 zhangshi0512/DevTools

A lightweight Python-based Software Package for daily use

11
Experimental
61 AlinaBaber/Document-Analysis-Identification-with-RAG-Vector-Database-and-Mistral-LLM

This Document Analysis pipeline is a comprehensive document analysis system,...

11
Experimental
62 pvmodayil/ragyphi

An entire RAG (Retrieval-Augmented Generation) pipeline library designed to...

11
Experimental
63 swax10/anaya

Anaya is a Content Engine that specializes in analyzing and comparing...

11
Experimental
64 SuchitG04/multi_doc_rag

RAG application to query multiple docs. Built to query 10K reports of companies.

10
Experimental
65 ITSAIDI/RAGify

RAGify is a Retrieval-Augmented Generation (RAG) application designed to...

10
Experimental