Spark Hadoop ML Pipelines ML Frameworks
Distributed machine learning frameworks and tools built on Apache Spark, Hadoop, or similar big data processing systems for large-scale data processing. Does NOT include standalone ML libraries, REST API wrappers without distributed computation, or Spring Boot microservices without core data processing components.
There are 81 spark hadoop ml pipelines frameworks tracked. 15 score above 50 (established tier). The highest-rated is lensacom/sparkit-learn at 60/100 with 1,151 stars.
Get all 81 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=ml-frameworks&subcategory=spark-hadoop-ml-pipelines&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Framework | Score | Tier |
|---|---|---|---|
| 1 |
lensacom/sparkit-learn
PySpark + Scikit-learn = Sparkit-learn |
|
Established |
| 2 |
Angel-ML/angel
A Flexible and Powerful Parameter Server for large-scale machine learning |
|
Established |
| 3 |
flink-extended/dl-on-flink
Deep Learning on Flink aims to integrate Flink and deep learning frameworks... |
|
Established |
| 4 |
tirthajyoti/Spark-with-Python
Fundamentals of Spark with Python (using PySpark), code examples |
|
Established |
| 5 |
jadianes/spark-py-notebooks
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine... |
|
Established |
| 6 |
kaiwaehner/kafka-streams-machine-learning-examples
This project contains examples which demonstrate how to deploy analytic... |
|
Established |
| 7 |
mahmoudparsian/data-algorithms-book
MapReduce, Spark, Java, and Scala for Data Algorithms Book |
|
Established |
| 8 |
MingChen0919/learning-apache-spark
Notes on Apache Spark (pyspark) |
|
Established |
| 9 |
databricks/spark-sklearn
(Deprecated) Scikit-learn integration package for Apache Spark |
|
Established |
| 10 |
alibaba/Alink
Alink is the Machine Learning algorithm platform based on Flink, developed... |
|
Established |
| 11 |
OryxProject/oryx
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time... |
|
Established |
| 12 |
endymecy/spark-ml-source-analysis
spark ml 算法原理剖析以及具体的源码实现分析 |
|
Established |
| 13 |
ShifuML/shifu
An end-to-end machine learning and data mining framework on Hadoop |
|
Established |
| 14 |
kaiwaehner/ksql-udf-deep-learning-mqtt-iot
Deep Learning UDF for KSQL for Streaming Anomaly Detection of MQTT IoT Sensor Data |
|
Established |
| 15 |
apache/flink-ml
Machine learning library of Apache Flink |
|
Established |
| 16 |
kanyun-inc/ytk-learn
Ytk-learn is a distributed machine learning library which implements most of... |
|
Emerging |
| 17 |
kaiwaehner/tensorflow-serving-java-grpc-kafka-streams
Kafka Streams + Java + gRPC + TensorFlow Serving => Stream Processing... |
|
Emerging |
| 18 |
sparkling-graph/sparkling-graph
SparklingGraph provides easy to use set of features that will give you... |
|
Emerging |
| 19 |
TodoEconometria/ejercicios-bigdata
Complete Big Data course with Python (230h) — SQLite to Kafka to TensorFlow.... |
|
Emerging |
| 20 |
kaiwaehner/ksql-fork-with-deep-learning-function
Deep Learning UDF for KSQL, the Streaming SQL Engine for Apache Kafka with... |
|
Emerging |
| 21 |
ShifuML/guagua
An iterative computing framework for both Hadoop MapReduce and Hadoop YARN. |
|
Emerging |
| 22 |
romain-e-lacoste/sparklen
A statistical learning toolkit for high-dimensional Hawkes processes in Python |
|
Emerging |
| 23 |
XYWENJIE/spring-ai-extension
An extension of Spring AI that supports Alibaba Cloud’s dashscope... |
|
Emerging |
| 24 |
siddhi-io/siddhi-execution-streamingml
Extension that performs streaming machine learning on event streams |
|
Emerging |
| 25 |
SAP-samples/hana-apl-apis-runtimes
Code examples for SAP HANA Automated Predictive Library (APL). It provides... |
|
Emerging |
| 26 |
sbl-sdsc/mmtf-spark
Methods for the parallel and distributed analysis and mining of the Protein... |
|
Emerging |
| 27 |
viadee/bpmn.ai
Machine learning around business processes |
|
Emerging |
| 28 |
arminmoin/ML-Quadrat
ML-Quadrat (ML2) is a Model-Driven Software Engineering (MDSE) tool with... |
|
Emerging |
| 29 |
microsoft/masc
Microsoft's contributions for Spark with Apache Accumulo |
|
Emerging |
| 30 |
feedzai/feedzai-openml
API for Feedzai's Open Machine Learning that allows to integrate ML... |
|
Emerging |
| 31 |
siddhi-io/siddhi-execution-tensorflow
Extension that adds support for inferences from pre-built TensorFlow SavedModels |
|
Emerging |
| 32 |
shalini0528/big-data-weather-analysis
Big Data weather analysis using Hadoop MapReduce, Apache Hive, Apache Spark,... |
|
Emerging |
| 33 |
jiumao-org/we-mall
A lightweigh mall, simple and esay. |
|
Emerging |
| 34 |
adventure-island/springboot-deepar-template
A Java(SpringBoot) template for Java and AWS SageMaker DeepAR model endpoint... |
|
Emerging |
| 35 |
predictiveworks/cdap-spark
A wrapper for Apache Spark to make machine & deep learning available in... |
|
Emerging |
| 36 |
AlanBinu007/AI_Big-Data_Data-Engineering_and_Distributions
Here we created some projects using Kafka, AI , Data virtualization and... |
|
Emerging |
| 37 |
mikeroyal/Apache-Spark-Guide
Apache Spark Guide |
|
Emerging |
| 38 |
iaja/scalaLDAvis
Scala-Spark port of https://github.com/bmabey/pyLDAvis for Apache Spark LDA... |
|
Emerging |
| 39 |
alipay/jpmml-sparkml-lightgbm
JPMML-SparkML plugin for converting LightGBM-Spark models to PMML |
|
Emerging |
| 40 |
chen0040/java-machine-learning-web-api
A simple machine learning web server that caters for small datasets |
|
Emerging |
| 41 |
AxaFrance/spring-ai-workshop
Exploring interactions with LLMs : Practical insights with Spring AI |
|
Emerging |
| 42 |
almo/Machine-Learning
Machine Learning snippets and use cases. |
|
Emerging |
| 43 |
manuparra/taller_SparkR
Taller SparkR para las Jornadas de Usuarios de R |
|
Emerging |
| 44 |
senx/warp10-ext-pmml
WarpScript™ PMML Extension |
|
Experimental |
| 45 |
DeathReaper0965/distributed-deeplearning
End to End Distributed Deep Learning Engine, works both with Streaming and... |
|
Experimental |
| 46 |
pneff93/Kafka-R-Realtime-Prediction
This tutorial explains how a machine learning model is applied on real-time data |
|
Experimental |
| 47 |
nicolaskrier/spring-ai-examples
Spring AI Examples |
|
Experimental |
| 48 |
nickozoulis/thunderstorm
Investigating the trade-offs of low latency responses over quality when... |
|
Experimental |
| 49 |
neerajkesav/SparkMLJavaExamples
Apache Spark Machine Learning - Java Examples |
|
Experimental |
| 50 |
kriss024/Spark
Spark for Data Science and ETL process. |
|
Experimental |
| 51 |
galafis/spark-kafka-ml-training-pipeline
Distributed ML training pipeline with Spark processing, Kafka ingestion and... |
|
Experimental |
| 52 |
iamirmasoud/pyspark_tutorials
Machine Learning for Big Data using PySpark with real-world projects |
|
Experimental |
| 53 |
TravelXML/APACHE-SPARK-PYSPARK-DATABRICKS-MACHINE-LEARNING-MLIB
Apache Spark Machine Learning project using MLlib and Linear Regression on... |
|
Experimental |
| 54 |
agoda-com/spark-hpopt
Bayesian hyperparamter tuning for Spark MLLib |
|
Experimental |
| 55 |
siddhi-io/siddhi-gpl-execution-pmml
Siddhi extension to evaluate Predictive Model Markup Language (PMML). |
|
Experimental |
| 56 |
AmrrSalem/Pyspark-Local
Portable self-contained PySpark 3.5 environment for Big Data coursework,... |
|
Experimental |
| 57 |
Chih-Ling-Hsu/Spark-Machine-Learning-Modules
Machine Learning Modules of Spark MLlib |
|
Experimental |
| 58 |
alikemalocalan/Spark-API
Apache Spark Recommendation/Machine Learning Api Service |
|
Experimental |
| 59 |
adil-faiyaz98/accelerated-spark-gpu
This repository demonstrates how to significantly accelerate Apache Spark 3... |
|
Experimental |
| 60 |
daugraph/ParameterServer
Parameter Server using Java |
|
Experimental |
| 61 |
GPalfy/socialnetworkcomments
:memo: Text Data Analysis & Machine Learning on supermarket's Social... |
|
Experimental |
| 62 |
MinLee0210/kafka-learning
Learning how to use Kafka |
|
Experimental |
| 63 |
Mazennaji/ai-intelligence-platform-java-ml
An all-in-one Java Machine Learning platform integrating fraud detection,... |
|
Experimental |
| 64 |
hevc15hamza/pyspark-airfoil-noise-prediction
Predict airfoil self-noise using PySpark with an end-to-end machine learning... |
|
Experimental |
| 65 |
Sowdeshwar-99/noise-aware-ml-pipeline
Noise-aware ML pipeline for large-scale agricultural yield prediction using... |
|
Experimental |
| 66 |
Sishant123/scala-m9k
🚀 Streamline big data processing with Scala and M9K, enhancing performance... |
|
Experimental |
| 67 |
Swapnil-2596/scala-aba
🚀 Transform Scala code into efficient, scalable applications with scala-aba,... |
|
Experimental |
| 68 |
aengusmartindonaire/pyspark-ml-pipeline
PySpark ML classification pipelines for NLP, clinical prediction, and census... |
|
Experimental |
| 69 |
mn-cs/fineweb-spark
FineWeb-Edu dataset analysis using Apache Spark - DSC 232R group project |
|
Experimental |
| 70 |
hinzy97/spark-dynamic-executor-time-prediction
Neural Network Models for Predicting Execution Time with Dynamic Executor... |
|
Experimental |
| 71 |
FadilAdz/praktikumBigData
Repository ini berisi rangkaian praktikum Big Data yang mencakup penyimpanan... |
|
Experimental |
| 72 |
shakha-de/mnist-java-microservice
Spring Boot Micorservice for MNIST |
|
Experimental |
| 73 |
sivasurya681/PySpark
PySpark-Roadmap is an 18-day structured learning journey that takes you from... |
|
Experimental |
| 74 |
balaghali/SparkML
Machine Learning with Spark |
|
Experimental |
| 75 |
roll-w/ml-model-platform
A platform for training machine learning models based on SpringBoot. |
|
Experimental |
| 76 |
ccoughlin/GristMill
Distributed Region Of Interest (ROI) finder |
|
Experimental |
| 77 |
JohnSesana/PySpark-Cheat-Sheet
List of useful commands for Pyspark |
|
Experimental |
| 78 |
urcuqui/Apache-Spark
This repository has some examples of Hadoop Spark |
|
Experimental |
| 79 |
geosparks/geospark-android-sdk-example
Example app for GeoSpark Android SDK |
|
Experimental |
| 80 |
johnfire/grendelmind
decision and analysis software for a robot |
|
Experimental |
| 81 |
chouaib-629/hepmassClassification
Pipeline PySpark pour la classification de particules en physique des hautes... |
|
Experimental |