jawerty/html2vec
Vectorize HTML files and generate embeddings with structural and semantic expression (WIP)
This tool helps data analysts and researchers transform raw HTML files into numerical representations (vectors or embeddings) that capture both the content and the structure of web pages. It takes a directory of HTML files as input and outputs a matrix of these numerical representations, making it easier to analyze and compare web content programmatically. This is useful for tasks like content classification, similarity detection, or trend analysis across many web pages.
No commits in the last 6 months.
Use this if you need to convert a collection of HTML documents into a machine-readable, numerical format for further analytical processing.
Not ideal if you only need to extract text or specific data points from HTML without considering the structural relationships of the page elements.
Stars
11
Forks
1
Language
JavaScript
License
—
Category
Last pushed
Feb 16, 2023
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/embeddings/jawerty/html2vec"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
Azure/azure-search-vector-samples
A repository of code samples for Vector search capabilities in Azure AI Search.
curiosity-ai/catalyst
🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's...
supabase/embeddings-generator
GitHub Action to generate embeddings from the markdown files in your repository.
vector-ai/vectorai
Vector AI — A platform for building vector based applications. Encode, query and analyse data...
wagtail/wagtail-vector-index
Store Wagtail pages & Django models as embeddings in vector databases