Web-to-Markdown RAG RAG Tools
Tools that crawl websites, documentation, and web content to convert into clean Markdown format optimized for RAG pipelines and offline use. Does NOT include PDF extraction, search indexing, or tools that don't produce Markdown output.
There are 99 web-to-markdown rag tools tracked. 11 score above 50 (established tier). The highest-rated is any4ai/AnyCrawl at 66/100 with 2,763 stars. 3 of the top 10 are actively maintained.
Get all 99 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=rag&subcategory=web-to-markdown-rag&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
any4ai/AnyCrawl
AnyCrawl π: A Node.js/TypeScript crawler that turns websites into LLM-ready... |
|
Established |
| 2 |
kreuzberg-dev/html-to-markdown
High performance and CommonMark compliant HTML to Markdown converter.... |
|
Established |
| 3 |
ScrapeGraphAI/Scrapegraph-ai
Python scraper based on AI |
|
Established |
| 4 |
adbar/trafilatura
Python & Command-line tool to gather text and metadata on the Web: Crawling,... |
|
Established |
| 5 |
paulpierre/markdown-crawler
A multithreaded πΈοΈ web crawler that recursively crawls a website and creates... |
|
Established |
| 6 |
lightfeed/extractor
Using LLMs and AI browser automation to robustly extract web data |
|
Established |
| 7 |
firecrawl/firecrawl-app-examples
π₯ This repository contains complete application examples, including websites... |
|
Established |
| 8 |
AnkitNayak-eth/CrawlAI-RAG
CrawlAI RAG is an AI-powered website intelligence platform that allows users... |
|
Established |
| 9 |
sigoden/rag-crawler
Crawl a website to generate knowledge file for RAG |
|
Established |
| 10 |
rodricios/wxpath
wxpath - declarative web crawling with XPath; a Web Query Language (WQL) |
|
Established |
| 11 |
luisleo526/doc2mark
AI-powered Python library that converts any document (PDF, Word, Excel,... |
|
Established |
| 12 |
apify/rag-web-browser
RAG Web Browser is an Apify Actor to feed your LLM applications and RAG... |
|
Emerging |
| 13 |
intergalacticalvariable/reader
π This is an adapted version of Jina AI's Reader for local deployment using... |
|
Emerging |
| 14 |
m92vyas/llm-reader
Turn Webpage to LLM friendly input text. Similar to Firecrawl and Jina... |
|
Emerging |
| 15 |
vishwajeetdabholkar/eGet-Crawler-for-ai
Web scraping framework built for AI applications. Extract clean, structured... |
|
Emerging |
| 16 |
dezoito/markitdown-api
Ultra lightweight API server to convert files (.pdf, .docx, .xlsx) into... |
|
Emerging |
| 17 |
opendatalab/MinerU-HTML
MinerU-HTML: An SLM-powered HTML main content extractor that outputs clean... |
|
Emerging |
| 18 |
BjornMelin/ai-docs-vector-db-hybrid-scraper
Retrieval-augmented docs ingestion stack: Firecrawl + Crawl4AI + Qdrant... |
|
Emerging |
| 19 |
raintree-technology/docpull
Crawl any website and convert it to clean, AI-ready Markdown β async Python... |
|
Emerging |
| 20 |
supacrawler/supacrawler
Supacrawler's ultralight engine for scraping and crawling the web. Written... |
|
Emerging |
| 21 |
mrmps/pdf2md
Browser based tool to convert PDFs to Markdown |
|
Emerging |
| 22 |
Thordata/Thordata
> Official Thordata developer portal repository. Curated overview of... |
|
Emerging |
| 23 |
KylinMountain/markify
Convert files into markdown to help RAG or LLM understand, based on... |
|
Emerging |
| 24 |
mensfeld/llm-docs-builder
Transform and optimize your markdown documentation for Large Language Models... |
|
Emerging |
| 25 |
philschmid/clipper.js
HTML to Markdown converter and crawler. |
|
Emerging |
| 26 |
iamarunbrahma/pdf-to-markdown
Conversion of PDF documents to structured Markdown, optimized for Retrieval... |
|
Emerging |
| 27 |
jtgsystems/free-sitemap-generator
πΊοΈ Free sitemap generator - Create XML sitemaps for SEO |
|
Emerging |
| 28 |
yaniv-golan/ostruct
Schema-first AI analysis CLI that transforms messy data into structured... |
|
Emerging |
| 29 |
pc8544/Website-Crawler
Extract data from websites in LLM ready JSON or CSV format. Crawl or Scrape... |
|
Emerging |
| 30 |
KimSeogyu/undocx
Extract clean, structured Markdown from DOCX for LLM and RAG contexts. |
|
Emerging |
| 31 |
Tendo33/arxiv-md
One-click conversion of arXiv papers to Markdown with perfect LaTeX formula... |
|
Emerging |
| 32 |
agoodway/html2markdown
Convert HTML to Markdown with Elixir |
|
Emerging |
| 33 |
buildwithfiroz/Web2-LLM.txt
Web2LLM.txt β A fast, open-source website-to-LLM context file generator.... |
|
Emerging |
| 34 |
aqueeb/confluence2md
Convert Confluence MIME exports (.doc) to clean Markdown |
|
Emerging |
| 35 |
BrowserCash/browser-serp
Real-time Google Search API for AI Agents & RAG pipelines. Get structured... |
|
Emerging |
| 36 |
malvads/mojo
Non sucking cross-platform extremely fast C++ crawler to convert entire... |
|
Emerging |
| 37 |
Thordata/thordata-firecrawl
Thordata Firecrawl β Firecrawl-compatible web crawling & scraping API built... |
|
Emerging |
| 38 |
Karthick-840/Crawl4ai-RAG-with-Local-LLM
A tool for scraping web documentation using Crawl4AI, converting it to... |
|
Emerging |
| 39 |
WebCrawlerAPI/webcrawlerapi-js-sdk
A WebcrawlerAPI SDK for Node JS |
|
Emerging |
| 40 |
arkeodev/scraper
RAG-based Web Scraping |
|
Emerging |
| 41 |
sethupavan12/Markdownify
Convert documents, images to high-quality Markdown using Vision LLMs. Built... |
|
Emerging |
| 42 |
sgowdaks/nichirin
RAG and Webcrawler in a single package |
|
Emerging |
| 43 |
wldevries/confluence-rag
Tool that fetches Confluence pages, converts them to markdown and chunks... |
|
Emerging |
| 44 |
pgEdge/pgedge-docloader
A tool for converting HTML and RST docs into Markdown, and loading them into... |
|
Experimental |
| 45 |
EasyDevv/project-to-markdown
Project To Markdown: Project files into structured markdown, optimizing... |
|
Experimental |
| 46 |
ctokx/url-to-markdown
Convert webpages to clean Markdown for LLM and RAG workflows. Browser-based... |
|
Experimental |
| 47 |
ngpepin/pdftomd-RAG
RAG workflow-friendly enhancement of Marker that converts PDFs into a... |
|
Experimental |
| 48 |
Paparusi/crawlkit
π·οΈ Open-source web crawling toolkit β Video, OCR, NLP, Stealth, 10+ parsers |
|
Experimental |
| 49 |
isSpicyCode/scrappe-tout
Scrappe-Tout is a web scraping tool designed to convert HTML documentation... |
|
Experimental |
| 50 |
Thordata/thordata-cookbook
Real-world recipes and examples for building AI data pipelines with Thordata. |
|
Experimental |
| 51 |
TylerMorrison21/paperflow
Open-source PDF-to-Markdown post-processor with footnotes, LaTeX... |
|
Experimental |
| 52 |
Thordata/thordata-web-qa-agent
> Web-native QA agent built on Thordata that delivers a Perplexity-style... |
|
Experimental |
| 53 |
pedrokohler/github-repo-to-single-file
TypeScript CLI that pulls a GitHub repo and merges all text-like files into... |
|
Experimental |
| 54 |
JamesN-dev/Scroll-Scribe
ScrollScribe is a Python CLI toolkit that grab docs or index website pages... |
|
Experimental |
| 55 |
pengboyu-dev/Athanor-Epub-Converter
πEPUB to RAG-ready Markdown with chunking, diagnostics, and clean structured output. |
|
Experimental |
| 56 |
ilyashusterman/doc-to-readable
Universal document-to-markdown and section splitter for HTML, URLs, and PDFs. |
|
Experimental |
| 57 |
marimo-marine23/xlmelt
Convert complex Excel files into AI-readable JSON/HTML |
|
Experimental |
| 58 |
nadya1992024/llm-parse
Parse HTML and markdown offline with a lightweight, single-header C++... |
|
Experimental |
| 59 |
AlphaDev007/AlphaCrawl
A high-performance, asynchronous Go web crawler built to extract LLM-ready... |
|
Experimental |
| 60 |
auto-medica-labs/md-tree
Convert Markdown files into hierarchical JSON tree structures with optional... |
|
Experimental |
| 61 |
danke-global/crawl2kb
Crawl a website and export embedding-ready chunks for RAG pipelines |
|
Experimental |
| 62 |
pinion05/llm-page-context
Turn any web page into clean LLM-ready context strings and structured documents. |
|
Experimental |
| 63 |
davidjsors/br-pdf-to-md-to-rag
Conversor de PDFs para Markdown estruturado, otimizado para ingestΓ£o em... |
|
Experimental |
| 64 |
jackise69/pdf-sentinel
π‘οΈ Convert PDF files to Markdown for LLM workflows with event-driven... |
|
Experimental |
| 65 |
vinaes/md-succ-ai
URL to Markdown API β md.succ.ai |
|
Experimental |
| 66 |
wmahfoudh/pdf-to-md
Automates the pipeline of converting PDF documents and images into clean... |
|
Experimental |
| 67 |
sumit7235/Domfie
π οΈ Simplify web scraping with Domfie, the self-healing scraper that adapts... |
|
Experimental |
| 68 |
gsusI/llm-docs-sync
Fetch official LLM provider docs (OpenAI, Gemini) from llms.txt into... |
|
Experimental |
| 69 |
Quippy22/web2llm
Fetch web pages and convert to clean Markdown for LLM pipelines |
|
Experimental |
| 70 |
bill-work/md-pdf-md
π Convert Markdown to visually appealing PDFs and extract Markdown from PDFs... |
|
Experimental |
| 71 |
GTA509FX/scrappe-tout
π Convert web pages to clean Markdown fast with Playwright, perfect for... |
|
Experimental |
| 72 |
zcag/readdown
HTML to clean Markdown optimized for LLMs. Replaces readability + turndown.... |
|
Experimental |
| 73 |
Horlicks-p/Moelog-LLMs.txt
This plugin implements the emerging llms.txt specification for WordPress,... |
|
Experimental |
| 74 |
chris-c-thomas/LexBuild
Open-source toolchain that converts the U.S. Code from legislative XML... |
|
Experimental |
| 75 |
Ai4GenXers/pdf-sentinel
Event-driven PDF to Markdown conversion for LLM workflows - 60x faster, zero... |
|
Experimental |
| 76 |
itsmeyessir/Domfie
An autonomous web scraper that fixes its own broken selectors using a... |
|
Experimental |
| 77 |
moria97/fastpdf4llm
Lightweight and fast library to convert PDF to markdown format. |
|
Experimental |
| 78 |
SupervisedCo/HyperCrawlTurbo
HypercrawlTurbo is a turbocharged web scraper for extracting URLs from a webpage. |
|
Experimental |
| 79 |
kwanLeeFrmVi/Crawler4AI-to-mardown-files
This project is designed to crawl documentation websites and convert them... |
|
Experimental |
| 80 |
QuiddityAI/PDFerret
An all-in-one converter to make your files LLM-understandable |
|
Experimental |
| 81 |
PetrAPConsulting/image2md
Convert batch of pictures with structured data like tables, formulas, charts... |
|
Experimental |
| 82 |
AhmedZeyadTareq/Content_To_Markdown_OCR
convert any file to markdown format |
|
Experimental |
| 83 |
QLangstaff/qrawl
Composable web crawling tools for Rust |
|
Experimental |
| 84 |
aaronlifton/fastcrawl
an agentic, atomics-driven Rust web crawler optimized for low heap usage,... |
|
Experimental |
| 85 |
Thordata/thordata-rag-pipeline
π Production-grade RAG pipeline powered by Thordata Scrapers. Turn any... |
|
Experimental |
| 86 |
the-ai-entrepreneur-ai-hub/ai-training-data-scraper
AI Training Data Scraper - Extract LLM & RAG-Ready Web Content for Machine... |
|
Experimental |
| 87 |
OutofAi/manemark
Manemark allows users to capture and save the text content of webpages so it... |
|
Experimental |
| 88 |
elementarpartikel/ultimate-web-crawler
Webbdammsugare Pro v3.0 Γ€r en GUI-baserad webbcrawler fΓΆr AI- och... |
|
Experimental |
| 89 |
siddueswar/doc-crawler-rag
π·οΈ Ingest clean documentation into LLM pipelines effortlessly, filtering out... |
|
Experimental |
| 90 |
amadou-6e/pymdt2json
pymdt2json is a Python CLI and library for converting markdown tables into... |
|
Experimental |
| 91 |
bloomresearch/InSite
A lightning fast tool for crawling websites and compiling PDFs of their pages |
|
Experimental |
| 92 |
m1r4g3-code/Distill
Distill β Turn any URL into clean, structured data for AI pipelines, RAG... |
|
Experimental |
| 93 |
abcd2113004/url-reader
π Extract content from any URL with smart platform detection and automatic... |
|
Experimental |
| 94 |
JeremySmythDigital/sitevac
scrape any docs site into one AI-ready file TXT, Markdown, or pre-chunked... |
|
Experimental |
| 95 |
Edgaras0x4E/web-loader-engine
High-performance web content extraction engine built in Rust. Primary... |
|
Experimental |
| 96 |
im-shashanks/PdfToMarkdown
Lightweight PDF to Markdown converter. |
|
Experimental |
| 97 |
andyfe76/Page-Layout-LLM-context
Convert PDF/Excel/HTML to text maintaining layout |
|
Experimental |
| 98 |
rahulsamant37/AI-Scraper
Universal Web Scraping AI Processing Pipeline: A dynamic, AI-powered web... |
|
Experimental |
| 99 |
timscodebase/docusaurus-plugin-llms-txt
Generate llms.txt context files automatically from your Docusaurus build. |
|
Experimental |