Sriram-PR/doc-scraper

Go web crawler to scrape documentation sites and convert content to clean Markdown for LLM ingestion (RAG, training data).

/ 100

Emerging

This tool helps AI engineers, data scientists, and machine learning practitioners gather documentation from websites to feed into their Large Language Models (LLMs). You provide the URLs of the documentation sites and specific content areas, and it outputs clean, structured Markdown files. These files are perfect for improving your LLM's knowledge base or for training new models.

Use this if you need to reliably collect and convert online technical documentation into a clean Markdown format for use with LLMs or RAG systems.

Not ideal if you're looking to scrape arbitrary web pages for general data extraction or if your primary goal isn't LLM training or RAG.

LLM training data RAG system documentation processing AI data preparation knowledge base population

No Package No Dependents

Maintenance 10 / 25

Adoption 9 / 25

Maturity 15 / 25

Community 12 / 25

How are scores calculated?

Stars

Forks

Language

License

Apache-2.0

Related tools

techdebtgpt/architecture-doc-generator

AI-powered architecture documentation generator with RAG, hybrid retrieval (semantic +...

ThalesMMS/reports-to-llm

Convert medical reports from DOCX/RTF to clean text optimized for LLM training and RAG systems.

Explore RAG Tools

All categories Trending RAG directory Insights