vintasoftware/entity-embed

PyTorch library for transforming entities like companies, products, etc. into vectors to support scalable Record Linkage / Entity Resolution using Approximate Nearest Neighbors.

48
/ 100
Emerging

This tool helps data professionals efficiently identify duplicate records or link related entries across different datasets, like customer lists or product catalogs. It takes your raw data containing entities (e.g., company names, product descriptions) and transforms them into numerical representations. This allows you to quickly find potential matches among thousands of records, making the first step of record cleaning or data integration much faster.

161 stars. No commits in the last 6 months. Available on PyPI.

Use this if you need to quickly find a large number of potential duplicate or matching records across vast datasets, prioritizing finding almost all possible matches even if some are later refined.

Not ideal if your primary goal is to produce only perfect, confirmed matches without needing to review a list of highly similar candidates first.

data-matching record-linkage entity-resolution data-deduplication data-stewardship
Stale 6m
Maintenance 0 / 25
Adoption 10 / 25
Maturity 25 / 25
Community 13 / 25

How are scores calculated?

Stars

161

Forks

16

Language

Jupyter Notebook

License

MIT

Last pushed

Nov 18, 2022

Commits (30d)

0

Dependencies

12

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/embeddings/vintasoftware/entity-embed"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.