Web-to-Markdown RAG RAG Tools

Tools that crawl websites, documentation, and web content to convert into clean Markdown format optimized for RAG pipelines and offline use. Does NOT include PDF extraction, search indexing, or tools that don't produce Markdown output.

There are 99 web-to-markdown rag tools tracked. 11 score above 50 (established tier). The highest-rated is any4ai/AnyCrawl at 66/100 with 2,763 stars. 3 of the top 10 are actively maintained.

Get all 99 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=rag&subcategory=web-to-markdown-rag&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 any4ai/AnyCrawl

AnyCrawl πŸš€: A Node.js/TypeScript crawler that turns websites into LLM-ready...

66
Established
2 kreuzberg-dev/html-to-markdown

High performance and CommonMark compliant HTML to Markdown converter....

64
Established
3 ScrapeGraphAI/Scrapegraph-ai

Python scraper based on AI

62
Established
4 adbar/trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling,...

60
Established
5 paulpierre/markdown-crawler

A multithreaded πŸ•ΈοΈ web crawler that recursively crawls a website and creates...

53
Established
6 lightfeed/extractor

Using LLMs and AI browser automation to robustly extract web data

53
Established
7 firecrawl/firecrawl-app-examples

πŸ”₯ This repository contains complete application examples, including websites...

53
Established
8 AnkitNayak-eth/CrawlAI-RAG

CrawlAI RAG is an AI-powered website intelligence platform that allows users...

51
Established
9 sigoden/rag-crawler

Crawl a website to generate knowledge file for RAG

50
Established
10 rodricios/wxpath

wxpath - declarative web crawling with XPath; a Web Query Language (WQL)

50
Established
11 luisleo526/doc2mark

AI-powered Python library that converts any document (PDF, Word, Excel,...

50
Established
12 apify/rag-web-browser

RAG Web Browser is an Apify Actor to feed your LLM applications and RAG...

49
Emerging
13 intergalacticalvariable/reader

πŸ“š This is an adapted version of Jina AI's Reader for local deployment using...

49
Emerging
14 m92vyas/llm-reader

Turn Webpage to LLM friendly input text. Similar to Firecrawl and Jina...

48
Emerging
15 vishwajeetdabholkar/eGet-Crawler-for-ai

Web scraping framework built for AI applications. Extract clean, structured...

45
Emerging
16 dezoito/markitdown-api

Ultra lightweight API server to convert files (.pdf, .docx, .xlsx) into...

45
Emerging
17 opendatalab/MinerU-HTML

MinerU-HTML: An SLM-powered HTML main content extractor that outputs clean...

45
Emerging
18 BjornMelin/ai-docs-vector-db-hybrid-scraper

Retrieval-augmented docs ingestion stack: Firecrawl + Crawl4AI + Qdrant...

44
Emerging
19 raintree-technology/docpull

Crawl any website and convert it to clean, AI-ready Markdown β€” async Python...

43
Emerging
20 supacrawler/supacrawler

Supacrawler's ultralight engine for scraping and crawling the web. Written...

41
Emerging
21 mrmps/pdf2md

Browser based tool to convert PDFs to Markdown

40
Emerging
22 Thordata/Thordata

> Official Thordata developer portal repository. Curated overview of...

40
Emerging
23 KylinMountain/markify

Convert files into markdown to help RAG or LLM understand, based on...

40
Emerging
24 mensfeld/llm-docs-builder

Transform and optimize your markdown documentation for Large Language Models...

40
Emerging
25 philschmid/clipper.js

HTML to Markdown converter and crawler.

40
Emerging
26 iamarunbrahma/pdf-to-markdown

Conversion of PDF documents to structured Markdown, optimized for Retrieval...

39
Emerging
27 jtgsystems/free-sitemap-generator

πŸ—ΊοΈ Free sitemap generator - Create XML sitemaps for SEO

39
Emerging
28 yaniv-golan/ostruct

Schema-first AI analysis CLI that transforms messy data into structured...

39
Emerging
29 pc8544/Website-Crawler

Extract data from websites in LLM ready JSON or CSV format. Crawl or Scrape...

38
Emerging
30 KimSeogyu/undocx

Extract clean, structured Markdown from DOCX for LLM and RAG contexts.

38
Emerging
31 Tendo33/arxiv-md

One-click conversion of arXiv papers to Markdown with perfect LaTeX formula...

38
Emerging
32 agoodway/html2markdown

Convert HTML to Markdown with Elixir

37
Emerging
33 buildwithfiroz/Web2-LLM.txt

Web2LLM.txt – A fast, open-source website-to-LLM context file generator....

35
Emerging
34 aqueeb/confluence2md

Convert Confluence MIME exports (.doc) to clean Markdown

33
Emerging
35 BrowserCash/browser-serp

Real-time Google Search API for AI Agents & RAG pipelines. Get structured...

33
Emerging
36 malvads/mojo

Non sucking cross-platform extremely fast C++ crawler to convert entire...

32
Emerging
37 Thordata/thordata-firecrawl

Thordata Firecrawl – Firecrawl-compatible web crawling & scraping API built...

32
Emerging
38 Karthick-840/Crawl4ai-RAG-with-Local-LLM

A tool for scraping web documentation using Crawl4AI, converting it to...

32
Emerging
39 WebCrawlerAPI/webcrawlerapi-js-sdk

A WebcrawlerAPI SDK for Node JS

32
Emerging
40 arkeodev/scraper

RAG-based Web Scraping

31
Emerging
41 sethupavan12/Markdownify

Convert documents, images to high-quality Markdown using Vision LLMs. Built...

31
Emerging
42 sgowdaks/nichirin

RAG and Webcrawler in a single package

30
Emerging
43 wldevries/confluence-rag

Tool that fetches Confluence pages, converts them to markdown and chunks...

30
Emerging
44 pgEdge/pgedge-docloader

A tool for converting HTML and RST docs into Markdown, and loading them into...

28
Experimental
45 EasyDevv/project-to-markdown

Project To Markdown: Project files into structured markdown, optimizing...

27
Experimental
46 ctokx/url-to-markdown

Convert webpages to clean Markdown for LLM and RAG workflows. Browser-based...

27
Experimental
47 ngpepin/pdftomd-RAG

RAG workflow-friendly enhancement of Marker that converts PDFs into a...

26
Experimental
48 Paparusi/crawlkit

πŸ•·οΈ Open-source web crawling toolkit β€” Video, OCR, NLP, Stealth, 10+ parsers

26
Experimental
49 isSpicyCode/scrappe-tout

Scrappe-Tout is a web scraping tool designed to convert HTML documentation...

25
Experimental
50 Thordata/thordata-cookbook

Real-world recipes and examples for building AI data pipelines with Thordata.

25
Experimental
51 TylerMorrison21/paperflow

Open-source PDF-to-Markdown post-processor with footnotes, LaTeX...

25
Experimental
52 Thordata/thordata-web-qa-agent

> Web-native QA agent built on Thordata that delivers a Perplexity-style...

25
Experimental
53 pedrokohler/github-repo-to-single-file

TypeScript CLI that pulls a GitHub repo and merges all text-like files into...

24
Experimental
54 JamesN-dev/Scroll-Scribe

ScrollScribe is a Python CLI toolkit that grab docs or index website pages...

23
Experimental
55 pengboyu-dev/Athanor-Epub-Converter

πŸ“˜EPUB to RAG-ready Markdown with chunking, diagnostics, and clean structured output.

22
Experimental
56 ilyashusterman/doc-to-readable

Universal document-to-markdown and section splitter for HTML, URLs, and PDFs.

22
Experimental
57 marimo-marine23/xlmelt

Convert complex Excel files into AI-readable JSON/HTML

22
Experimental
58 nadya1992024/llm-parse

Parse HTML and markdown offline with a lightweight, single-header C++...

22
Experimental
59 AlphaDev007/AlphaCrawl

A high-performance, asynchronous Go web crawler built to extract LLM-ready...

22
Experimental
60 auto-medica-labs/md-tree

Convert Markdown files into hierarchical JSON tree structures with optional...

22
Experimental
61 danke-global/crawl2kb

Crawl a website and export embedding-ready chunks for RAG pipelines

22
Experimental
62 pinion05/llm-page-context

Turn any web page into clean LLM-ready context strings and structured documents.

22
Experimental
63 davidjsors/br-pdf-to-md-to-rag

Conversor de PDFs para Markdown estruturado, otimizado para ingestΓ£o em...

22
Experimental
64 jackise69/pdf-sentinel

πŸ›‘οΈ Convert PDF files to Markdown for LLM workflows with event-driven...

22
Experimental
65 vinaes/md-succ-ai

URL to Markdown API β€” md.succ.ai

22
Experimental
66 wmahfoudh/pdf-to-md

Automates the pipeline of converting PDF documents and images into clean...

21
Experimental
67 sumit7235/Domfie

πŸ› οΈ Simplify web scraping with Domfie, the self-healing scraper that adapts...

21
Experimental
68 gsusI/llm-docs-sync

Fetch official LLM provider docs (OpenAI, Gemini) from llms.txt into...

21
Experimental
69 Quippy22/web2llm

Fetch web pages and convert to clean Markdown for LLM pipelines

21
Experimental
70 bill-work/md-pdf-md

πŸ“„ Convert Markdown to visually appealing PDFs and extract Markdown from PDFs...

21
Experimental
71 GTA509FX/scrappe-tout

πŸš€ Convert web pages to clean Markdown fast with Playwright, perfect for...

21
Experimental
72 zcag/readdown

HTML to clean Markdown optimized for LLMs. Replaces readability + turndown....

21
Experimental
73 Horlicks-p/Moelog-LLMs.txt

This plugin implements the emerging llms.txt specification for WordPress,...

21
Experimental
74 chris-c-thomas/LexBuild

Open-source toolchain that converts the U.S. Code from legislative XML...

21
Experimental
75 Ai4GenXers/pdf-sentinel

Event-driven PDF to Markdown conversion for LLM workflows - 60x faster, zero...

21
Experimental
76 itsmeyessir/Domfie

An autonomous web scraper that fixes its own broken selectors using a...

20
Experimental
77 moria97/fastpdf4llm

Lightweight and fast library to convert PDF to markdown format.

20
Experimental
78 SupervisedCo/HyperCrawlTurbo

HypercrawlTurbo is a turbocharged web scraper for extracting URLs from a webpage.

20
Experimental
79 kwanLeeFrmVi/Crawler4AI-to-mardown-files

This project is designed to crawl documentation websites and convert them...

18
Experimental
80 QuiddityAI/PDFerret

An all-in-one converter to make your files LLM-understandable

18
Experimental
81 PetrAPConsulting/image2md

Convert batch of pictures with structured data like tables, formulas, charts...

18
Experimental
82 AhmedZeyadTareq/Content_To_Markdown_OCR

convert any file to markdown format

18
Experimental
83 QLangstaff/qrawl

Composable web crawling tools for Rust

17
Experimental
84 aaronlifton/fastcrawl

an agentic, atomics-driven Rust web crawler optimized for low heap usage,...

17
Experimental
85 Thordata/thordata-rag-pipeline

πŸš€ Production-grade RAG pipeline powered by Thordata Scrapers. Turn any...

15
Experimental
86 the-ai-entrepreneur-ai-hub/ai-training-data-scraper

AI Training Data Scraper - Extract LLM & RAG-Ready Web Content for Machine...

14
Experimental
87 OutofAi/manemark

Manemark allows users to capture and save the text content of webpages so it...

14
Experimental
88 elementarpartikel/ultimate-web-crawler

Webbdammsugare Pro v3.0 Γ€r en GUI-baserad webbcrawler fΓΆr AI- och...

14
Experimental
89 siddueswar/doc-crawler-rag

πŸ•·οΈ Ingest clean documentation into LLM pipelines effortlessly, filtering out...

14
Experimental
90 amadou-6e/pymdt2json

pymdt2json is a Python CLI and library for converting markdown tables into...

14
Experimental
91 bloomresearch/InSite

A lightning fast tool for crawling websites and compiling PDFs of their pages

14
Experimental
92 m1r4g3-code/Distill

Distill β€” Turn any URL into clean, structured data for AI pipelines, RAG...

13
Experimental
93 abcd2113004/url-reader

πŸ” Extract content from any URL with smart platform detection and automatic...

13
Experimental
94 JeremySmythDigital/sitevac

scrape any docs site into one AI-ready file TXT, Markdown, or pre-chunked...

13
Experimental
95 Edgaras0x4E/web-loader-engine

High-performance web content extraction engine built in Rust. Primary...

13
Experimental
96 im-shashanks/PdfToMarkdown

Lightweight PDF to Markdown converter.

11
Experimental
97 andyfe76/Page-Layout-LLM-context

Convert PDF/Excel/HTML to text maintaining layout

11
Experimental
98 rahulsamant37/AI-Scraper

Universal Web Scraping AI Processing Pipeline: A dynamic, AI-powered web...

11
Experimental
99 timscodebase/docusaurus-plugin-llms-txt

Generate llms.txt context files automatically from your Docusaurus build.

10
Experimental