Aavache/LLMWebCrawler
A Web Crawler based on LLMs implemented with Ray and Huggingface. The embeddings are saved into a vector database for fast clustering and retrieval. Use it for your RAG.
This tool helps data engineers and machine learning practitioners build specialized search applications by collecting and processing large amounts of web data. It takes a list of starting website URLs, navigates through linked pages, extracts text, and converts it into numerical representations called embeddings. The output is a searchable database of web page text and their embeddings, which can be used to find similar web content quickly.
No commits in the last 6 months.
Use this if you need to ingest web content at scale and organize it in a way that allows for semantic search and retrieval based on content similarity.
Not ideal if you're looking for a simple, off-the-shelf web scraping tool without needing to work with text embeddings or vector databases.
Stars
98
Forks
13
Language
Python
License
—
Category
Last pushed
Oct 15, 2023
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/vector-db/Aavache/LLMWebCrawler"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
yichuan-w/LEANN
[MLsys2026]: RAG on Everything with LEANN. Enjoy 97% storage savings while running a fast,...
byerlikaya/SmartRAG
Multi-Modal RAG for .NET — query databases, documents, images and audio in natural language....
aws-samples/layout-aware-document-processing-and-retrieval-augmented-generation
Advanced document extraction and chunking techniques for retrieval augmented generation that is...
sourangshupal/simple-rag-langchain
Exploring the Basics of Langchain
sion42x/llama-index-milvus-example
Open AI APIs with Llama Index and Milvus Vector DB for Retrieval Augmented Generation (RAG) testing