Aavache/LLMWebCrawler

A Web Crawler based on LLMs implemented with Ray and Huggingface. The embeddings are saved into a vector database for fast clustering and retrieval. Use it for your RAG.

32
/ 100
Emerging

This tool helps data engineers and machine learning practitioners build specialized search applications by collecting and processing large amounts of web data. It takes a list of starting website URLs, navigates through linked pages, extracts text, and converts it into numerical representations called embeddings. The output is a searchable database of web page text and their embeddings, which can be used to find similar web content quickly.

No commits in the last 6 months.

Use this if you need to ingest web content at scale and organize it in a way that allows for semantic search and retrieval based on content similarity.

Not ideal if you're looking for a simple, off-the-shelf web scraping tool without needing to work with text embeddings or vector databases.

information-retrieval web-data-collection search-engine-building semantic-search data-engineering
No License Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 9 / 25
Maturity 8 / 25
Community 15 / 25

How are scores calculated?

Stars

98

Forks

13

Language

Python

License

Category

local-rag-stacks

Last pushed

Oct 15, 2023

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/vector-db/Aavache/LLMWebCrawler"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.