lensacom/sparkit-learn

PySpark + Scikit-learn = Sparkit-learn

60
/ 100
Established

This project helps data scientists and machine learning engineers apply familiar scikit-learn models and data transformations to very large datasets, leveraging the distributed processing power of Apache Spark. It takes your raw data, like text documents or numerical arrays, and transforms it into specialized distributed data structures (ArrayRDD, SparseRDD, DictRDD) that can be processed across a cluster. The output is a processed distributed dataset ready for large-scale machine learning.

1,151 stars. No commits in the last 6 months. Available on PyPI.

Use this if you are a data scientist or machine learning engineer who needs to run scikit-learn compatible machine learning workflows on datasets that are too large to fit into a single machine's memory and require distributed processing with PySpark.

Not ideal if your datasets are small enough to be processed by scikit-learn on a single machine, or if you prefer a different distributed computing framework.

distributed-machine-learning large-scale-data-processing predictive-modeling text-feature-extraction big-data-analytics
Stale 6m No Dependents
Maintenance 0 / 25
Adoption 10 / 25
Maturity 25 / 25
Community 25 / 25

How are scores calculated?

Stars

1,151

Forks

255

Language

Python

License

Apache-2.0

Last pushed

Dec 31, 2020

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/ml-frameworks/lensacom/sparkit-learn"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.