QuivrHQ/MegaParse

File Parser optimised for LLM Ingestion with no loss 🧠 Parse PDFs, Docx, PPTx in a format that is ideal for LLMs.

/ 100

Emerging

This tool helps you convert complex documents like PDFs, Word files, and PowerPoints into a clean, comprehensive text format that AI models (LLMs) can easily understand. It takes your existing documents and processes them, ensuring all critical information, including tables and images, is preserved, producing highly accurate text ready for AI analysis or querying. Anyone building applications that use AI to read and interpret business documents, research papers, or reports would find this useful.

7,347 stars. No commits in the last 6 months.

Use this if you need to reliably extract all content from diverse document types (PDFs, Word, PowerPoint) for use with large language models, without losing any critical information like tables or image context.

Not ideal if you only need simple text extraction without concern for preserving complex formatting, tables, or integrating with advanced AI models.

document-processing AI-application-development information-extraction knowledge-management

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 18 / 25

How are scores calculated?

Stars

7,347

Forks

416

Language

Python

License

Apache-2.0

Higher-rated alternatives

NanoNets/docstrange

Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple...

th1nhhdk/local_ai_ocr

An local, offline (after initial setup), portable OCR software that can process images and PDF...

Dicklesworthstone/llm_aided_ocr

Enhances Tesseract OCR output using LLMs (local or API) for error correction, smart chunking,...

emcf/thepipe

Get clean data from tricky documents, powered by vision-language models ⚡

langstruct-ai/langstruct

Extract structured data from any content using LLMs.

Explore LLM Tools

All categories Trending LLM Tool directory Insights