derrickburns/generalized-kmeans-clustering
Production-ready K-Means clustering for Apache Spark with pluggable Bregman divergences (KL, Itakura-Saito, L1, etc). 6 algorithms, 740 tests, cross-version persistence. Drop-in replacement for MLlib with mathematically correct distance functions for probability distributions, spectral data, and count data.
This tool helps data scientists and machine learning engineers analyze large datasets by grouping similar data points together. You input raw data, like probability distributions or spectral readings, and it outputs clusters of related data, along with assignments indicating which group each data point belongs to. This is ideal for tasks requiring sophisticated grouping of complex data.
342 stars.
Use this if you need to group vast amounts of specialized data, such as probability distributions or spectral data, using mathematically precise distance measures, and you're working within the Apache Spark ecosystem.
Not ideal if your data is simple and Euclidean distance is sufficient, or if you are not operating on large datasets with Apache Spark.
Stars
342
Forks
53
Language
Scala
License
Apache-2.0
Category
Last pushed
Feb 14, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/embeddings/derrickburns/generalized-kmeans-clustering"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
TorchDR/TorchDR
TorchDR - PyTorch Dimensionality Reduction
abhilash1910/ClusterTransformer
Topic clustering library built on Transformer embeddings and cosine similarity...
md-experiments/picture_text
Interactive tree-maps with SBERT & Hierarchical Clustering (HAC)
mainlp/semantic_components
Finding semantic components in your neural representations.
scientist-labs/clusterkit
High-performance UMAP dimensionality reduction for Ruby, powered by the annembed Rust crate....