Spark Hadoop Ml Pipelines Data Engineering Tools

There are 20 spark hadoop ml pipelines tools tracked. 10 score above 50 (established tier). The highest-rated is knime/knime-core at 68/100 with 772 stars. 2 of the top 10 are actively maintained.

Get all 20 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=data-engineering&subcategory=spark-hadoop-ml-pipelines&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 knime/knime-core

KNIME Analytics Platform

68
Established
2 sparklyr/sparklyr

R interface for Apache Spark

61
Established
3 apache/wayang

Apache Wayang is the first cross-platform data processing system.

61
Established
4 quixio/quix-streams

Python Streaming DataFrames for Kafka

60
Established
5 jtablesaw/tablesaw

Java dataframe and visualization library

60
Established
6 RumbleDB/rumble

Quick start: pip install jsoniq ⛈️ RumbleDB 2.0.0 "Lemon Ironwood" 🌳 for...

59
Established
7 dotnet/spark

.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.

59
Established
8 h2oai/sparkling-water

Sparkling Water provides H2O functionality inside Spark cluster

57
Established
9 evinism/mistql

A query / expression language for performing computations on JSON-like...

57
Established
10 byzer-org/byzer-lang

Byzer (former MLSQL): A low-code open-source programming language for data...

51
Established
11 mc2-project/opaque-sql

An encrypted data analytics platform

49
Emerging
12 viadee/camunda-kafka-polling-client

Stream your process history to Kafka

36
Emerging
13 Smart-Shaped/chaM3Leon

By Smart Shaped s.r.l. (https://www.smartshaped.com/)

35
Emerging
14 rhinempi/sparkhit

sparkhit - analyzing large scale genomic data on the cloud

33
Emerging
15 perguard/pg-streaming-performance-data

Data collection, feature engineering and machine learning of performance traces

31
Emerging
16 AvaAvarai/Java-Parallel-Coordinates-Vis

Java Parallel Coordinates Visualization Tool, to visualize...

30
Emerging
17 dhchenx/Catla-HS

Catla for Hadoop and Spark (Catla-HS): An open-source system to support...

29
Experimental
18 maengsanha/bigdata

KMU CS Hot Topics in Big Data

28
Experimental
19 aymane-maghouti/Big-Data-Project

This project aims to predict smartphone prices using a combination of batch...

27
Experimental
20 maistrovyi/actio

actio

10
Experimental