Spark Hadoop ML Pipelines ML Frameworks

Distributed machine learning frameworks and tools built on Apache Spark, Hadoop, or similar big data processing systems for large-scale data processing. Does NOT include standalone ML libraries, REST API wrappers without distributed computation, or Spring Boot microservices without core data processing components.

There are 81 spark hadoop ml pipelines frameworks tracked. 15 score above 50 (established tier). The highest-rated is lensacom/sparkit-learn at 60/100 with 1,151 stars.

Get all 81 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=ml-frameworks&subcategory=spark-hadoop-ml-pipelines&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Framework Score Tier
1 lensacom/sparkit-learn

PySpark + Scikit-learn = Sparkit-learn

60
Established
2 Angel-ML/angel

A Flexible and Powerful Parameter Server for large-scale machine learning

53
Established
3 flink-extended/dl-on-flink

Deep Learning on Flink aims to integrate Flink and deep learning frameworks...

51
Established
4 tirthajyoti/Spark-with-Python

Fundamentals of Spark with Python (using PySpark), code examples

51
Established
5 jadianes/spark-py-notebooks

Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine...

51
Established
6 kaiwaehner/kafka-streams-machine-learning-examples

This project contains examples which demonstrate how to deploy analytic...

51
Established
7 mahmoudparsian/data-algorithms-book

MapReduce, Spark, Java, and Scala for Data Algorithms Book

51
Established
8 MingChen0919/learning-apache-spark

Notes on Apache Spark (pyspark)

51
Established
9 databricks/spark-sklearn

(Deprecated) Scikit-learn integration package for Apache Spark

51
Established
10 alibaba/Alink

Alink is the Machine Learning algorithm platform based on Flink, developed...

51
Established
11 OryxProject/oryx

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time...

51
Established
12 endymecy/spark-ml-source-analysis

spark ml 算法原理剖析以及具体的源码实现分析

51
Established
13 ShifuML/shifu

An end-to-end machine learning and data mining framework on Hadoop

50
Established
14 kaiwaehner/ksql-udf-deep-learning-mqtt-iot

Deep Learning UDF for KSQL for Streaming Anomaly Detection of MQTT IoT Sensor Data

50
Established
15 apache/flink-ml

Machine learning library of Apache Flink

50
Established
16 kanyun-inc/ytk-learn

Ytk-learn is a distributed machine learning library which implements most of...

49
Emerging
17 kaiwaehner/tensorflow-serving-java-grpc-kafka-streams

Kafka Streams + Java + gRPC + TensorFlow Serving => Stream Processing...

48
Emerging
18 sparkling-graph/sparkling-graph

SparklingGraph provides easy to use set of features that will give you...

47
Emerging
19 TodoEconometria/ejercicios-bigdata

Complete Big Data course with Python (230h) — SQLite to Kafka to TensorFlow....

46
Emerging
20 kaiwaehner/ksql-fork-with-deep-learning-function

Deep Learning UDF for KSQL, the Streaming SQL Engine for Apache Kafka with...

46
Emerging
21 ShifuML/guagua

An iterative computing framework for both Hadoop MapReduce and Hadoop YARN.

46
Emerging
22 romain-e-lacoste/sparklen

A statistical learning toolkit for high-dimensional Hawkes processes in Python

45
Emerging
23 XYWENJIE/spring-ai-extension

An extension of Spring AI that supports Alibaba Cloud’s dashscope...

45
Emerging
24 siddhi-io/siddhi-execution-streamingml

Extension that performs streaming machine learning on event streams

45
Emerging
25 SAP-samples/hana-apl-apis-runtimes

Code examples for SAP HANA Automated Predictive Library (APL). It provides...

44
Emerging
26 sbl-sdsc/mmtf-spark

Methods for the parallel and distributed analysis and mining of the Protein...

42
Emerging
27 viadee/bpmn.ai

Machine learning around business processes

40
Emerging
28 arminmoin/ML-Quadrat

ML-Quadrat (ML2) is a Model-Driven Software Engineering (MDSE) tool with...

39
Emerging
29 microsoft/masc

Microsoft's contributions for Spark with Apache Accumulo

39
Emerging
30 feedzai/feedzai-openml

API for Feedzai's Open Machine Learning that allows to integrate ML...

39
Emerging
31 siddhi-io/siddhi-execution-tensorflow

Extension that adds support for inferences from pre-built TensorFlow SavedModels

38
Emerging
32 shalini0528/big-data-weather-analysis

Big Data weather analysis using Hadoop MapReduce, Apache Hive, Apache Spark,...

36
Emerging
33 jiumao-org/we-mall

A lightweigh mall, simple and esay.

35
Emerging
34 adventure-island/springboot-deepar-template

A Java(SpringBoot) template for Java and AWS SageMaker DeepAR model endpoint...

35
Emerging
35 predictiveworks/cdap-spark

A wrapper for Apache Spark to make machine & deep learning available in...

34
Emerging
36 AlanBinu007/AI_Big-Data_Data-Engineering_and_Distributions

Here we created some projects using Kafka, AI , Data virtualization and...

34
Emerging
37 mikeroyal/Apache-Spark-Guide

Apache Spark Guide

33
Emerging
38 iaja/scalaLDAvis

Scala-Spark port of https://github.com/bmabey/pyLDAvis for Apache Spark LDA...

33
Emerging
39 alipay/jpmml-sparkml-lightgbm

JPMML-SparkML plugin for converting LightGBM-Spark models to PMML

33
Emerging
40 chen0040/java-machine-learning-web-api

A simple machine learning web server that caters for small datasets

32
Emerging
41 AxaFrance/spring-ai-workshop

Exploring interactions with LLMs : Practical insights with Spring AI

32
Emerging
42 almo/Machine-Learning

Machine Learning snippets and use cases.

32
Emerging
43 manuparra/taller_SparkR

Taller SparkR para las Jornadas de Usuarios de R

32
Emerging
44 senx/warp10-ext-pmml

WarpScript™ PMML Extension

29
Experimental
45 DeathReaper0965/distributed-deeplearning

End to End Distributed Deep Learning Engine, works both with Streaming and...

28
Experimental
46 pneff93/Kafka-R-Realtime-Prediction

This tutorial explains how a machine learning model is applied on real-time data

27
Experimental
47 nicolaskrier/spring-ai-examples

Spring AI Examples

27
Experimental
48 nickozoulis/thunderstorm

Investigating the trade-offs of low latency responses over quality when...

24
Experimental
49 neerajkesav/SparkMLJavaExamples

Apache Spark Machine Learning - Java Examples

23
Experimental
50 kriss024/Spark

Spark for Data Science and ETL process.

23
Experimental
51 galafis/spark-kafka-ml-training-pipeline

Distributed ML training pipeline with Spark processing, Kafka ingestion and...

22
Experimental
52 iamirmasoud/pyspark_tutorials

Machine Learning for Big Data using PySpark with real-world projects

22
Experimental
53 TravelXML/APACHE-SPARK-PYSPARK-DATABRICKS-MACHINE-LEARNING-MLIB

Apache Spark Machine Learning project using MLlib and Linear Regression on...

22
Experimental
54 agoda-com/spark-hpopt

Bayesian hyperparamter tuning for Spark MLLib

20
Experimental
55 siddhi-io/siddhi-gpl-execution-pmml

Siddhi extension to evaluate Predictive Model Markup Language (PMML).

19
Experimental
56 AmrrSalem/Pyspark-Local

Portable self-contained PySpark 3.5 environment for Big Data coursework,...

19
Experimental
57 Chih-Ling-Hsu/Spark-Machine-Learning-Modules

Machine Learning Modules of Spark MLlib

18
Experimental
58 alikemalocalan/Spark-API

Apache Spark Recommendation/Machine Learning Api Service

18
Experimental
59 adil-faiyaz98/accelerated-spark-gpu

This repository demonstrates how to significantly accelerate Apache Spark 3...

18
Experimental
60 daugraph/ParameterServer

Parameter Server using Java

17
Experimental
61 GPalfy/socialnetworkcomments

:memo: Text Data Analysis & Machine Learning on supermarket's Social...

17
Experimental
62 MinLee0210/kafka-learning

Learning how to use Kafka

15
Experimental
63 Mazennaji/ai-intelligence-platform-java-ml

An all-in-one Java Machine Learning platform integrating fraud detection,...

15
Experimental
64 hevc15hamza/pyspark-airfoil-noise-prediction

Predict airfoil self-noise using PySpark with an end-to-end machine learning...

14
Experimental
65 Sowdeshwar-99/noise-aware-ml-pipeline

Noise-aware ML pipeline for large-scale agricultural yield prediction using...

14
Experimental
66 Sishant123/scala-m9k

🚀 Streamline big data processing with Scala and M9K, enhancing performance...

14
Experimental
67 Swapnil-2596/scala-aba

🚀 Transform Scala code into efficient, scalable applications with scala-aba,...

14
Experimental
68 aengusmartindonaire/pyspark-ml-pipeline

PySpark ML classification pipelines for NLP, clinical prediction, and census...

14
Experimental
69 mn-cs/fineweb-spark

FineWeb-Edu dataset analysis using Apache Spark - DSC 232R group project

14
Experimental
70 hinzy97/spark-dynamic-executor-time-prediction

Neural Network Models for Predicting Execution Time with Dynamic Executor...

13
Experimental
71 FadilAdz/praktikumBigData

Repository ini berisi rangkaian praktikum Big Data yang mencakup penyimpanan...

13
Experimental
72 shakha-de/mnist-java-microservice

Spring Boot Micorservice for MNIST

13
Experimental
73 sivasurya681/PySpark

PySpark-Roadmap is an 18-day structured learning journey that takes you from...

13
Experimental
74 balaghali/SparkML

Machine Learning with Spark

11
Experimental
75 roll-w/ml-model-platform

A platform for training machine learning models based on SpringBoot.

11
Experimental
76 ccoughlin/GristMill

Distributed Region Of Interest (ROI) finder

11
Experimental
77 JohnSesana/PySpark-Cheat-Sheet

List of useful commands for Pyspark

11
Experimental
78 urcuqui/Apache-Spark

This repository has some examples of Hadoop Spark

11
Experimental
79 geosparks/geospark-android-sdk-example

Example app for GeoSpark Android SDK

10
Experimental
80 johnfire/grendelmind

decision and analysis software for a robot

10
Experimental
81 chouaib-629/hepmassClassification

Pipeline PySpark pour la classification de particules en physique des hautes...

10
Experimental