Sriram-PR/doc-scraper

Go web crawler to scrape documentation sites and convert content to clean Markdown for LLM ingestion (RAG, training data).

46
/ 100
Emerging

This tool helps AI engineers, data scientists, and machine learning practitioners gather documentation from websites to feed into their Large Language Models (LLMs). You provide the URLs of the documentation sites and specific content areas, and it outputs clean, structured Markdown files. These files are perfect for improving your LLM's knowledge base or for training new models.

Use this if you need to reliably collect and convert online technical documentation into a clean Markdown format for use with LLMs or RAG systems.

Not ideal if you're looking to scrape arbitrary web pages for general data extraction or if your primary goal isn't LLM training or RAG.

LLM training data RAG system documentation processing AI data preparation knowledge base population
No Package No Dependents
Maintenance 10 / 25
Adoption 9 / 25
Maturity 15 / 25
Community 12 / 25

How are scores calculated?

Stars

86

Forks

8

Language

Go

License

Apache-2.0

Last pushed

Feb 21, 2026

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/rag/Sriram-PR/doc-scraper"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.