vintasoftware/entity-embed
PyTorch library for transforming entities like companies, products, etc. into vectors to support scalable Record Linkage / Entity Resolution using Approximate Nearest Neighbors.
This tool helps data professionals efficiently identify duplicate records or link related entries across different datasets, like customer lists or product catalogs. It takes your raw data containing entities (e.g., company names, product descriptions) and transforms them into numerical representations. This allows you to quickly find potential matches among thousands of records, making the first step of record cleaning or data integration much faster.
161 stars. No commits in the last 6 months. Available on PyPI.
Use this if you need to quickly find a large number of potential duplicate or matching records across vast datasets, prioritizing finding almost all possible matches even if some are later refined.
Not ideal if your primary goal is to produce only perfect, confirmed matches without needing to review a list of highly similar candidates first.
Stars
161
Forks
16
Language
Jupyter Notebook
License
MIT
Category
Last pushed
Nov 18, 2022
Commits (30d)
0
Dependencies
12
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/embeddings/vintasoftware/entity-embed"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
MilaNLProc/contextualized-topic-models
A python package to run contextualized topic modeling. CTMs combine contextualized embeddings...
vinid/cade
Compass-aligned Distributional Embeddings. Align embeddings from different corpora
spcl/ncc
Neural Code Comprehension: A Learnable Representation of Code Semantics
criteo-research/CausE
Code for the Recsys 2018 paper entitled Causal Embeddings for Recommandation.
ina-foss/twembeddings
Sentence embeddings for unsupervised event detection in the Twitter stream: study on English and...