lensacom/sparkit-learn
PySpark + Scikit-learn = Sparkit-learn
This project helps data scientists and machine learning engineers apply familiar scikit-learn models and data transformations to very large datasets, leveraging the distributed processing power of Apache Spark. It takes your raw data, like text documents or numerical arrays, and transforms it into specialized distributed data structures (ArrayRDD, SparseRDD, DictRDD) that can be processed across a cluster. The output is a processed distributed dataset ready for large-scale machine learning.
1,151 stars. No commits in the last 6 months. Available on PyPI.
Use this if you are a data scientist or machine learning engineer who needs to run scikit-learn compatible machine learning workflows on datasets that are too large to fit into a single machine's memory and require distributed processing with PySpark.
Not ideal if your datasets are small enough to be processed by scikit-learn on a single machine, or if you prefer a different distributed computing framework.
Stars
1,151
Forks
255
Language
Python
License
Apache-2.0
Category
Last pushed
Dec 31, 2020
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/ml-frameworks/lensacom/sparkit-learn"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related frameworks
Angel-ML/angel
A Flexible and Powerful Parameter Server for large-scale machine learning
flink-extended/dl-on-flink
Deep Learning on Flink aims to integrate Flink and deep learning frameworks (e.g. TensorFlow,...
MingChen0919/learning-apache-spark
Notes on Apache Spark (pyspark)
mahmoudparsian/data-algorithms-book
MapReduce, Spark, Java, and Scala for Data Algorithms Book
endymecy/spark-ml-source-analysis
spark ml 算法原理剖析以及具体的源码实现分析