QuivrHQ/MegaParse
File Parser optimised for LLM Ingestion with no loss 🧠Parse PDFs, Docx, PPTx in a format that is ideal for LLMs.
This tool helps you convert complex documents like PDFs, Word files, and PowerPoints into a clean, comprehensive text format that AI models (LLMs) can easily understand. It takes your existing documents and processes them, ensuring all critical information, including tables and images, is preserved, producing highly accurate text ready for AI analysis or querying. Anyone building applications that use AI to read and interpret business documents, research papers, or reports would find this useful.
7,347 stars. No commits in the last 6 months.
Use this if you need to reliably extract all content from diverse document types (PDFs, Word, PowerPoint) for use with large language models, without losing any critical information like tables or image context.
Not ideal if you only need simple text extraction without concern for preserving complex formatting, tables, or integrating with advanced AI models.
Stars
7,347
Forks
416
Language
Python
License
Apache-2.0
Category
Last pushed
Feb 21, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/QuivrHQ/MegaParse"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
NanoNets/docstrange
Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple...
th1nhhdk/local_ai_ocr
An local, offline (after initial setup), portable OCR software that can process images and PDF...
Dicklesworthstone/llm_aided_ocr
Enhances Tesseract OCR output using LLMs (local or API) for error correction, smart chunking,...
emcf/thepipe
Get clean data from tricky documents, powered by vision-language models âš¡
langstruct-ai/langstruct
Extract structured data from any content using LLMs.