rragundez/chunkdot
Multi-threaded matrix multiplication and cosine similarity calculations for dense and sparse matrices. Appropriate for calculating the K most similar items for a large number of items by chunking the item matrix representation (embeddings) and using Numba to accelerate the calculations.
This tool helps data scientists and machine learning engineers efficiently find the most similar items within very large datasets. You input item representations, often called 'embeddings,' which can be either dense or sparse, and it outputs a list of the top K most similar (or dissimilar) items for each item in your dataset. This is particularly useful for tasks like recommendation systems or information retrieval.
No commits in the last 6 months. Available on PyPI.
Use this if you need to calculate the top K most similar items for a large number of items and want to do so quickly and memory-efficiently, even with datasets containing hundreds of thousands or millions of items.
Not ideal if your dataset is small or if you need to calculate exact similarity scores for every single pair of items rather than just the top K.
Stars
86
Forks
5
Language
Python
License
MIT
Category
Last pushed
Dec 28, 2024
Commits (30d)
0
Dependencies
5
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/embeddings/rragundez/chunkdot"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
Azure/azure-search-vector-samples
A repository of code samples for Vector search capabilities in Azure AI Search.
curiosity-ai/catalyst
🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's...
supabase/embeddings-generator
GitHub Action to generate embeddings from the markdown files in your repository.
vector-ai/vectorai
Vector AI — A platform for building vector based applications. Encode, query and analyse data...
wagtail/wagtail-vector-index
Store Wagtail pages & Django models as embeddings in vector databases