jawerty/html2vec

Vectorize HTML files and generate embeddings with structural and semantic expression (WIP)

20
/ 100
Experimental

This tool helps data analysts and researchers transform raw HTML files into numerical representations (vectors or embeddings) that capture both the content and the structure of web pages. It takes a directory of HTML files as input and outputs a matrix of these numerical representations, making it easier to analyze and compare web content programmatically. This is useful for tasks like content classification, similarity detection, or trend analysis across many web pages.

No commits in the last 6 months.

Use this if you need to convert a collection of HTML documents into a machine-readable, numerical format for further analytical processing.

Not ideal if you only need to extract text or specific data points from HTML without considering the structural relationships of the page elements.

web-content-analysis data-mining digital-analytics information-retrieval document-vectorization
No License Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 5 / 25
Maturity 8 / 25
Community 7 / 25

How are scores calculated?

Stars

11

Forks

1

Language

JavaScript

License

Last pushed

Feb 16, 2023

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/embeddings/jawerty/html2vec"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.