Aavache/LLMWebCrawler

A Web Crawler based on LLMs implemented with Ray and Huggingface. The embeddings are saved into a vector database for fast clustering and retrieval. Use it for your RAG.

/ 100

Emerging

This tool helps data engineers and machine learning practitioners build specialized search applications by collecting and processing large amounts of web data. It takes a list of starting website URLs, navigates through linked pages, extracts text, and converts it into numerical representations called embeddings. The output is a searchable database of web page text and their embeddings, which can be used to find similar web content quickly.

No commits in the last 6 months.

Use this if you need to ingest web content at scale and organize it in a way that allows for semantic search and retrieval based on content similarity.

Not ideal if you're looking for a simple, off-the-shelf web scraping tool without needing to work with text embeddings or vector databases.

information-retrieval web-data-collection search-engine-building semantic-search data-engineering

No License Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 9 / 25

Maturity 8 / 25

Community 15 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

—

Higher-rated alternatives

yichuan-w/LEANN

[MLsys2026]: RAG on Everything with LEANN. Enjoy 97% storage savings while running a fast,...

byerlikaya/SmartRAG

Multi-Modal RAG for .NET — query databases, documents, images and audio in natural language....

aws-samples/layout-aware-document-processing-and-retrieval-augmented-generation

Advanced document extraction and chunking techniques for retrieval augmented generation that is...

sourangshupal/simple-rag-langchain

Exploring the Basics of Langchain

sion42x/llama-index-milvus-example

Open AI APIs with Llama Index and Milvus Vector DB for Retrieval Augmented Generation (RAG) testing

Explore Vector Databases

All categories Trending Vector Database directory Insights