lensacom/sparkit-learn

PySpark + Scikit-learn = Sparkit-learn

/ 100

Established

This project helps data scientists and machine learning engineers apply familiar scikit-learn models and data transformations to very large datasets, leveraging the distributed processing power of Apache Spark. It takes your raw data, like text documents or numerical arrays, and transforms it into specialized distributed data structures (ArrayRDD, SparseRDD, DictRDD) that can be processed across a cluster. The output is a processed distributed dataset ready for large-scale machine learning.

1,151 stars. No commits in the last 6 months. Available on PyPI.

Use this if you are a data scientist or machine learning engineer who needs to run scikit-learn compatible machine learning workflows on datasets that are too large to fit into a single machine's memory and require distributed processing with PySpark.

Not ideal if your datasets are small enough to be processed by scikit-learn on a single machine, or if you prefer a different distributed computing framework.

distributed-machine-learning large-scale-data-processing predictive-modeling text-feature-extraction big-data-analytics

Stale 6m No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 25 / 25

Community 25 / 25

How are scores calculated?

Stars

1,151

Forks

255

Language

Python

License

Apache-2.0

Related frameworks

Angel-ML/angel

A Flexible and Powerful Parameter Server for large-scale machine learning

flink-extended/dl-on-flink

Deep Learning on Flink aims to integrate Flink and deep learning frameworks (e.g. TensorFlow,...

MingChen0919/learning-apache-spark

Notes on Apache Spark (pyspark)

mahmoudparsian/data-algorithms-book

MapReduce, Spark, Java, and Scala for Data Algorithms Book

endymecy/spark-ml-source-analysis

spark ml 算法原理剖析以及具体的源码实现分析

Explore ML Frameworks

All categories Trending ML Framework directory Insights