Web-to-Markdown RAG RAG Tools

Tools that crawl websites, documentation, and web content to convert into clean Markdown format optimized for RAG pipelines and offline use. Does NOT include PDF extraction, search indexing, or tools that don't produce Markdown output.

There are 99 web-to-markdown rag tools tracked. 11 score above 50 (established tier). The highest-rated is any4ai/AnyCrawl at 66/100 with 2,763 stars. 3 of the top 10 are actively maintained.

Get all 99 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=rag&subcategory=web-to-markdown-rag&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Tool	Score	Tier	Stars	Language
1	any4ai/AnyCrawl AnyCrawl 🚀: A Node.js/TypeScript crawler that turns websites into LLM-ready...	66	Established	2,763	TypeScript
2	kreuzberg-dev/html-to-markdown High performance and CommonMark compliant HTML to Markdown converter....	64	Established	565	HTML
3	ScrapeGraphAI/Scrapegraph-ai Python scraper based on AI	62	Established	22,929	Python
4	adbar/trafilatura Python & Command-line tool to gather text and metadata on the Web: Crawling,...	60	Established	5,481	Python
5	paulpierre/markdown-crawler A multithreaded 🕸️ web crawler that recursively crawls a website and creates...	53	Established	431	Python
6	lightfeed/extractor Using LLMs and AI browser automation to robustly extract web data	53	Established	60	TypeScript
7	firecrawl/firecrawl-app-examples 🔥 This repository contains complete application examples, including websites...	53	Established	690	Jupyter Notebook
8	AnkitNayak-eth/CrawlAI-RAG CrawlAI RAG is an AI-powered website intelligence platform that allows users...	51	Established	93	Python
9	sigoden/rag-crawler Crawl a website to generate knowledge file for RAG	50	Established	50	TypeScript
10	rodricios/wxpath wxpath - declarative web crawling with XPath; a Web Query Language (WQL)	50	Established	108	Python
11	luisleo526/doc2mark AI-powered Python library that converts any document (PDF, Word, Excel,...	50	Established	47	Python
12	apify/rag-web-browser RAG Web Browser is an Apify Actor to feed your LLM applications and RAG...	49	Emerging	72	TypeScript
13	intergalacticalvariable/reader 📚 This is an adapted version of Jina AI's Reader for local deployment using...	49	Emerging	295	TypeScript
14	m92vyas/llm-reader Turn Webpage to LLM friendly input text. Similar to Firecrawl and Jina...	48	Emerging	280	Python
15	vishwajeetdabholkar/eGet-Crawler-for-ai Web scraping framework built for AI applications. Extract clean, structured...	45	Emerging	53	Python
16	dezoito/markitdown-api Ultra lightweight API server to convert files (.pdf, .docx, .xlsx) into...	45	Emerging	65	Python
17	opendatalab/MinerU-HTML MinerU-HTML: An SLM-powered HTML main content extractor that outputs clean...	45	Emerging	217	HTML
18	BjornMelin/ai-docs-vector-db-hybrid-scraper Retrieval-augmented docs ingestion stack: Firecrawl + Crawl4AI + Qdrant...	44	Emerging	10	Python
19	raintree-technology/docpull Crawl any website and convert it to clean, AI-ready Markdown — async Python...	43	Emerging	20	Python
20	supacrawler/supacrawler Supacrawler's ultralight engine for scraping and crawling the web. Written...	41	Emerging	52	Go
21	mrmps/pdf2md Browser based tool to convert PDFs to Markdown	40	Emerging	303	TypeScript
22	Thordata/Thordata > Official Thordata developer portal repository. Curated overview of...	40	Emerging	4	—
23	KylinMountain/markify Convert files into markdown to help RAG or LLM understand, based on...	40	Emerging	133	Python
24	mensfeld/llm-docs-builder Transform and optimize your markdown documentation for Large Language Models...	40	Emerging	80	Ruby
25	philschmid/clipper.js HTML to Markdown converter and crawler.	40	Emerging	614	TypeScript
26	iamarunbrahma/pdf-to-markdown Conversion of PDF documents to structured Markdown, optimized for Retrieval...	39	Emerging	115	Python
27	jtgsystems/free-sitemap-generator 🗺️ Free sitemap generator - Create XML sitemaps for SEO	39	Emerging	1	Python
28	yaniv-golan/ostruct Schema-first AI analysis CLI that transforms messy data into structured...	39	Emerging	8	Python
29	pc8544/Website-Crawler Extract data from websites in LLM ready JSON or CSV format. Crawl or Scrape...	38	Emerging	74	Java
30	KimSeogyu/undocx Extract clean, structured Markdown from DOCX for LLM and RAG contexts.	38	Emerging	2	Rust
31	Tendo33/arxiv-md One-click conversion of arXiv papers to Markdown with perfect LaTeX formula...	38	Emerging	4	JavaScript
32	agoodway/html2markdown Convert HTML to Markdown with Elixir	37	Emerging	37	Elixir
33	buildwithfiroz/Web2-LLM.txt Web2LLM.txt – A fast, open-source website-to-LLM context file generator....	35	Emerging	7	Python
34	aqueeb/confluence2md Convert Confluence MIME exports (.doc) to clean Markdown	33	Emerging	37	Go
35	BrowserCash/browser-serp Real-time Google Search API for AI Agents & RAG pipelines. Get structured...	33	Emerging	22	TypeScript
36	malvads/mojo Non sucking cross-platform extremely fast C++ crawler to convert entire...	32	Emerging	12	C++
37	Thordata/thordata-firecrawl Thordata Firecrawl – Firecrawl-compatible web crawling & scraping API built...	32	Emerging	2	Python
38	Karthick-840/Crawl4ai-RAG-with-Local-LLM A tool for scraping web documentation using Crawl4AI, converting it to...	32	Emerging	6	Python
39	WebCrawlerAPI/webcrawlerapi-js-sdk A WebcrawlerAPI SDK for Node JS	32	Emerging	2	TypeScript
40	arkeodev/scraper RAG-based Web Scraping	31	Emerging	14	Python
41	sethupavan12/Markdownify Convert documents, images to high-quality Markdown using Vision LLMs. Built...	31	Emerging	21	Python
42	sgowdaks/nichirin RAG and Webcrawler in a single package	30	Emerging	2	Python
43	wldevries/confluence-rag Tool that fetches Confluence pages, converts them to markdown and chunks...	30	Emerging	1	C#
44	pgEdge/pgedge-docloader A tool for converting HTML and RST docs into Markdown, and loading them into...	28	Experimental	10	Go
45	EasyDevv/project-to-markdown Project To Markdown: Project files into structured markdown, optimizing...	27	Experimental	17	Python
46	ctokx/url-to-markdown Convert webpages to clean Markdown for LLM and RAG workflows. Browser-based...	27	Experimental	7	JavaScript
47	ngpepin/pdftomd-RAG RAG workflow-friendly enhancement of Marker that converts PDFs into a...	26	Experimental	4	Shell
48	Paparusi/crawlkit 🕷️ Open-source web crawling toolkit — Video, OCR, NLP, Stealth, 10+ parsers	26	Experimental	5	Python
49	isSpicyCode/scrappe-tout Scrappe-Tout is a web scraping tool designed to convert HTML documentation...	25	Experimental	7	JavaScript
50	Thordata/thordata-cookbook Real-world recipes and examples for building AI data pipelines with Thordata.	25	Experimental	2	Jupyter Notebook
51	TylerMorrison21/paperflow Open-source PDF-to-Markdown post-processor with footnotes, LaTeX...	25	Experimental	5	Python
52	Thordata/thordata-web-qa-agent > Web-native QA agent built on Thordata that delivers a Perplexity-style...	25	Experimental	2	Python
53	pedrokohler/github-repo-to-single-file TypeScript CLI that pulls a GitHub repo and merges all text-like files into...	24	Experimental	12	TypeScript
54	JamesN-dev/Scroll-Scribe ScrollScribe is a Python CLI toolkit that grab docs or index website pages...	23	Experimental	1	Python
55	pengboyu-dev/Athanor-Epub-Converter 📘EPUB to RAG-ready Markdown with chunking, diagnostics, and clean structured output.	22	Experimental	1	Go
56	ilyashusterman/doc-to-readable Universal document-to-markdown and section splitter for HTML, URLs, and PDFs.	22	Experimental	6	JavaScript
57	marimo-marine23/xlmelt Convert complex Excel files into AI-readable JSON/HTML	22	Experimental	—	Python
58	nadya1992024/llm-parse Parse HTML and markdown offline with a lightweight, single-header C++...	22	Experimental	—	C++
59	AlphaDev007/AlphaCrawl A high-performance, asynchronous Go web crawler built to extract LLM-ready...	22	Experimental	—	Go
60	auto-medica-labs/md-tree Convert Markdown files into hierarchical JSON tree structures with optional...	22	Experimental	—	TypeScript
61	danke-global/crawl2kb Crawl a website and export embedding-ready chunks for RAG pipelines	22	Experimental	—	Go
62	pinion05/llm-page-context Turn any web page into clean LLM-ready context strings and structured documents.	22	Experimental	—	JavaScript
63	davidjsors/br-pdf-to-md-to-rag Conversor de PDFs para Markdown estruturado, otimizado para ingestão em...	22	Experimental	1	Python
64	jackise69/pdf-sentinel 🛡️ Convert PDF files to Markdown for LLM workflows with event-driven...	22	Experimental	1	JavaScript
65	vinaes/md-succ-ai URL to Markdown API — md.succ.ai	22	Experimental	1	JavaScript
66	wmahfoudh/pdf-to-md Automates the pipeline of converting PDF documents and images into clean...	21	Experimental	—	Shell
67	sumit7235/Domfie 🛠️ Simplify web scraping with Domfie, the self-healing scraper that adapts...	21	Experimental	—	Jupyter Notebook
68	gsusI/llm-docs-sync Fetch official LLM provider docs (OpenAI, Gemini) from llms.txt into...	21	Experimental	—	Shell
69	Quippy22/web2llm Fetch web pages and convert to clean Markdown for LLM pipelines	21	Experimental	—	Rust
70	bill-work/md-pdf-md 📄 Convert Markdown to visually appealing PDFs and extract Markdown from PDFs...	21	Experimental	—	TypeScript
71	GTA509FX/scrappe-tout 🚀 Convert web pages to clean Markdown fast with Playwright, perfect for...	21	Experimental	—	JavaScript
72	zcag/readdown HTML to clean Markdown optimized for LLMs. Replaces readability + turndown....	21	Experimental	—	JavaScript
73	Horlicks-p/Moelog-LLMs.txt This plugin implements the emerging llms.txt specification for WordPress,...	21	Experimental	—	PHP
74	chris-c-thomas/LexBuild Open-source toolchain that converts the U.S. Code from legislative XML...	21	Experimental	—	TypeScript
75	Ai4GenXers/pdf-sentinel Event-driven PDF to Markdown conversion for LLM workflows - 60x faster, zero...	21	Experimental	2	JavaScript
76	itsmeyessir/Domfie An autonomous web scraper that fixes its own broken selectors using a...	20	Experimental	1	Jupyter Notebook
77	moria97/fastpdf4llm Lightweight and fast library to convert PDF to markdown format.	20	Experimental	1	Python
78	SupervisedCo/HyperCrawlTurbo HypercrawlTurbo is a turbocharged web scraper for extracting URLs from a webpage.	20	Experimental	10	Python
79	kwanLeeFrmVi/Crawler4AI-to-mardown-files This project is designed to crawl documentation websites and convert them...	18	Experimental	2	Python
80	QuiddityAI/PDFerret An all-in-one converter to make your files LLM-understandable	18	Experimental	2	HTML
81	PetrAPConsulting/image2md Convert batch of pictures with structured data like tables, formulas, charts...	18	Experimental	1	Python
82	AhmedZeyadTareq/Content_To_Markdown_OCR convert any file to markdown format	18	Experimental	1	Python
83	QLangstaff/qrawl Composable web crawling tools for Rust	17	Experimental	—	Rust
84	aaronlifton/fastcrawl an agentic, atomics-driven Rust web crawler optimized for low heap usage,...	17	Experimental	2	HTML
85	Thordata/thordata-rag-pipeline 🚀 Production-grade RAG pipeline powered by Thordata Scrapers. Turn any...	15	Experimental	2	Python
86	the-ai-entrepreneur-ai-hub/ai-training-data-scraper AI Training Data Scraper - Extract LLM & RAG-Ready Web Content for Machine...	14	Experimental	—	—
87	OutofAi/manemark Manemark allows users to capture and save the text content of webpages so it...	14	Experimental	—	JavaScript
88	elementarpartikel/ultimate-web-crawler Webbdammsugare Pro v3.0 är en GUI-baserad webbcrawler för AI- och...	14	Experimental	—	Python
89	siddueswar/doc-crawler-rag 🕷️ Ingest clean documentation into LLM pipelines effortlessly, filtering out...	14	Experimental	—	Python
90	amadou-6e/pymdt2json pymdt2json is a Python CLI and library for converting markdown tables into...	14	Experimental	1	Jupyter Notebook
91	bloomresearch/InSite A lightning fast tool for crawling websites and compiling PDFs of their pages	14	Experimental	1	Python
92	m1r4g3-code/Distill Distill — Turn any URL into clean, structured data for AI pipelines, RAG...	13	Experimental	—	TypeScript
93	abcd2113004/url-reader 🔍 Extract content from any URL with smart platform detection and automatic...	13	Experimental	—	Python
94	JeremySmythDigital/sitevac scrape any docs site into one AI-ready file TXT, Markdown, or pre-chunked...	13	Experimental	—	HTML
95	Edgaras0x4E/web-loader-engine High-performance web content extraction engine built in Rust. Primary...	13	Experimental	—	Rust
96	im-shashanks/PdfToMarkdown Lightweight PDF to Markdown converter.	11	Experimental	—	Python
97	andyfe76/Page-Layout-LLM-context Convert PDF/Excel/HTML to text maintaining layout	11	Experimental	—	Python
98	rahulsamant37/AI-Scraper Universal Web Scraping AI Processing Pipeline: A dynamic, AI-powered web...	11	Experimental	—	Python
99	timscodebase/docusaurus-plugin-llms-txt Generate llms.txt context files automatically from your Docusaurus build.	10	Experimental	1	JavaScript

Comparisons in this category

Scrapegraph-ai and eGet-Crawler-for-ai (62 vs 45)