rragundez/chunkdot

Multi-threaded matrix multiplication and cosine similarity calculations for dense and sparse matrices. Appropriate for calculating the K most similar items for a large number of items by chunking the item matrix representation (embeddings) and using Numba to accelerate the calculations.

42
/ 100
Emerging

This tool helps data scientists and machine learning engineers efficiently find the most similar items within very large datasets. You input item representations, often called 'embeddings,' which can be either dense or sparse, and it outputs a list of the top K most similar (or dissimilar) items for each item in your dataset. This is particularly useful for tasks like recommendation systems or information retrieval.

No commits in the last 6 months. Available on PyPI.

Use this if you need to calculate the top K most similar items for a large number of items and want to do so quickly and memory-efficiently, even with datasets containing hundreds of thousands or millions of items.

Not ideal if your dataset is small or if you need to calculate exact similarity scores for every single pair of items rather than just the top K.

similarity-search recommendation-systems information-retrieval large-scale-data-analysis machine-learning-engineering
Stale 6m
Maintenance 0 / 25
Adoption 9 / 25
Maturity 25 / 25
Community 8 / 25

How are scores calculated?

Stars

86

Forks

5

Language

Python

License

MIT

Last pushed

Dec 28, 2024

Commits (30d)

0

Dependencies

5

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/embeddings/rragundez/chunkdot"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.