All Data Engineering Tools

1,297 tools ranked by quality score · Page 2 of 13

Showing 101–200 of 1,297
# Tool Score Tier
101 apache/wayang

Apache Wayang is the first cross-platform data processing system.

61
Established
102 Breeze0806/go-etl

go-etl is a toolset for data extraction, transformation and loading.

61
Established
103 DataKitchen/dataops-testgen

DataOps Data Quality TestGen is part of DataKitchen's Open Source Data...

61
Established
104 quixio/quix-streams

Python Streaming DataFrames for Kafka

60
Established
105 langchain-ai/langchain-postgres

LangChain abstractions backed by Postgres Backend

60
Established
106 HTTP-RPC/Kilo

Lightweight REST for Java

60
Established
107 cnstlungu/portable-data-stack-dagster

A portable Datamart and Business Intelligence suite built with Docker,...

60
Established
108 bitol-io/open-data-contract-standard

Home of the Open Data Contract Standard (ODCS).

60
Established
109 jtablesaw/tablesaw

Java dataframe and visualization library

60
Established
110 turbot/steampipe-plugin-github

Use SQL to instantly query repositories, users, gists and more from GitHub....

59
Established
111 linkedpipes/etl

LinkedPipes ETL is an RDF based, lightweight ETL tool

59
Established
112 vmware/versatile-data-kit

One framework to develop, deploy and operate data workflows with Python and SQL.

59
Established
113 bacalhau-project/bacalhau

Community-driven, simple, yet powerful framework for fast, cost-effective...

59
Established
114 KipData/KiteSQL

Fast. Embedded. Rust-native SQL database.

59
Established
115 metafacture/metafacture-core

Core package of the Metafacture tool suite for metadata processing.

59
Established
116 RumbleDB/rumble

Quick start: pip install jsoniq ⛈️ RumbleDB 2.0.0 "Lemon Ironwood" 🌳 for...

59
Established
117 AbsaOSS/cobrix

A COBOL parser and Mainframe/EBCDIC data source for Apache Spark

59
Established
118 dalenewman/Transformalize

Configurable Extract, Transform, and Load

59
Established
119 dotnet/spark

.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.

59
Established
120 ucbepic/docetl

A system for agentic LLM-powered data processing and ETL

59
Established
121 vibhorkum/pg_background

Production-grade PostgreSQL extension to execute arbitrary SQL in background...

59
Established
122 turbot/steampipe

Zero-ETL, infinite possibilities. Live query APIs, code & more with SQL. No...

59
Established
123 DataTalksClub/data-engineering-zoomcamp

Data Engineering Zoomcamp is a free 9-week course on building...

59
Established
124 cparmet/pandas-checks

🐼🩺 Pandas Checks: Non-invasive health checks for Pandas method chains

59
Established
125 rudderlabs/rudder-server

Privacy and Security focused Segment-alternative, in Golang and React

58
Established
126 uptake/uptasticsearch

An Elasticsearch client tailored to data science workflows.

58
Established
127 dashmug/glue-utils

glue-utils makes AWS Glue jobs less repetitive, more type-safe, and easier...

58
Established
128 Data-Centric-AI-Community/ydata-profiling

1 Line of code data quality profiling & exploratory data analysis for Pandas...

58
Established
129 xorq-labs/xorq

A compute manifest and composable tools for data, built on Ibis, DataFusion,...

58
Established
130 snowplow/enrich

Snowplow Enrichment jobs and library

58
Established
131 dagster-io/community-integrations

Community supported integrations for the Dagster platform.

58
Established
132 9tigerio/db2rest

Instant no code DATA API platform for relational databases. Connect any...

58
Established
133 h2oai/sparkling-water

Sparkling Water provides H2O functionality inside Spark cluster

57
Established
134 dfpc-coe/CloudTAK

TAK Compatible, browser based Common Operation Picture & Situational Awareness tool

57
Established
135 datazip-inc/olake-ui

Frontend & BFF (Backend for frontend) for Olake. This includes the UI code...

57
Established
136 nshiab/simple-data-analysis

Easy-to-use and high-performance TypeScript library for data analysis. Works...

57
Established
137 turbot/steampipe-plugin-kubernetes

Use SQL to instantly query Kubernetes API resources. Open source CLI. No DB required.

57
Established
138 benjamin-awd/monopoly

Monopoly is a Python library & CLI that converts bank statement PDFs to CSV.

57
Established
139 evinism/mistql

A query / expression language for performing computations on JSON-like...

57
Established
140 turbot/steampipe-plugin-gcp

Use SQL to instantly query GCP resources across regions, projects and...

57
Established
141 dataflint/spark

Drop-in replacement for Apache Spark UI

57
Established
142 debba/tabularis

A lightweight, developer-focused database management tool. Supports MySQL,...

57
Established
143 turbot/steampipe-plugin-azure

Use SQL to instantly query Azure resources across regions and subscriptions....

56
Established
144 CogStack/CogStack-NiFi

Building data processing pipelines for documents processing with NLP using...

56
Established
145 snowplow/dbt-snowplow-web

A fully incremental model, that transforms raw web event data generated by...

56
Established
146 ICIJ/extract

A cross-platform command line tool for parallelised content extraction and analysis.

56
Established
147 alibaba/feathub

FeatHub - A stream-batch unified feature store for real-time machine learning

56
Established
148 starlake-ai/starlake

Declarative text based tool for data analysts and engineers to extract,...

56
Established
149 flowsynx/flowsynx

A deterministic orchestrator for composable micro-workflows with reusable modules

56
Established
150 mozilla/python_mozetl

ETL jobs for Firefox Telemetry

56
Established
151 techascent/tech.ml.dataset

A Clojure high performance data processing system

56
Established
152 reductstore/reductstore

High Performance Storage and Streaming Solution for Data Acquisition Systems

56
Established
153 DataSQRL/sqrl

Data Pipeline Automation Framework to build MCP servers, data APIs, and data...

55
Established
154 nodestream-proj/nodestream

A Declarative framework for Building, Maintaining, and Analyzing Graph Data

55
Established
155 odpi/egeria-docs

Documentation repository for the Egeria project.

55
Established
156 kay-ou/SimTradeData

SimTradeData is a utility library supporting SimTradeDesk, SimTradeLab and...

55
Established
157 Guepard-Corp/qwery-core

The Boring query platform - Connect and query anything

55
Established
158 Snowflake-Labs/emerging-solutions-toolbox

The Emerging Solutions Toolbox is a collection of solutions created by...

55
Established
159 turbot/steampipe-plugin-sdk

Steampipe Plugin SDK is a simple abstraction layer to write a Steampipe...

55
Established
160 docwire/docwire

DocWire SDK: Award-winning modern data processing in C++20. SourceForge...

55
Established
161 OHDSI/ETL-Synthea

A package supporting the conversion from Synthea CSV to OMOP CDM

54
Established
162 cnstlungu/portable-data-stack-mage

A portable Datamart and Business Intelligence suite built with Docker, Mage,...

54
Established
163 microsoft/unified-data-foundation-with-fabric-solution-accelerator

Unified Data Foundation with Microsoft Fabric with Options to Integrate with...

54
Established
164 apache/doris-kafka-connector

Kafka Connector for Apache Doris

54
Established
165 airyhq/airy

💬 Open Source App Framework to build streaming apps with real-time data - 💎...

54
Established
166 turbot/steampipe-plugin-jira

Use SQL to instantly query Jira. Open source CLI. No DB required.

54
Established
167 dflib/dflib

In-memory Java DataFrame library

54
Established
168 akmalsoliev/Validoopsie

A simple and easy to use Data Validation library for Python.

54
Established
169 heavyai/heavydb

HeavyDB (formerly MapD/OmniSciDB)

54
Established
170 tower/tower-cli

Next generation compute platform for the post-modern data stack

54
Established
171 kanton-bern/hellodata-be

The Open-Source Enterprise Data Platform in a single Portal

54
Established
172 cnstlungu/portable-data-stack-airflow

A portable Datamart and Business Intelligence suite built with Docker,...

54
Established
173 bytewax/bytewax

Python Stream Processing

53
Established
174 GovHub-br/gov-hub

GovHub - Transformando Dados em Valor para Gestão Pública

53
Established
175 rpsft/etlbox

A lightweight ETL (extract, transform, load) library and data integration...

53
Established
176 elyra-ai/pipeline-editor

Common pipeline-editor components used in different clients (e.g. Elyra...

53
Established
177 mprove-io/mprove

Open Source Business Intelligence with Malloy Semantic Layer :tada:

53
Established
178 dbt-labs/jaffle-shop

🥪🦘 An open source sandbox project exploring dbt workflows via a fictional...

53
Established
179 opensnowcat/opensnowcat-enrich

OpenSnowcat Enricher (Apache 2.0 License)

53
Established
180 GitBrincie212/ChronoGrapher

Powerful, developer-experience centric, blazingly fast and extensible job...

53
Established
181 caiopizzol/cnpj-data-pipeline

Pipeline open-source que baixa e processa os dados da Receita Federal para PostgreSQL

52
Established
182 spitfireuptown/datalinkx

🔥🔥DatalinkX异构数据源之间的数据同步系统,支持海量数据的增量或全量同步,同时支持HTTP、Oracle、MySQL、ES等数据源之间的数据流转,...

52
Established
183 SentryPeer/SentryPeer

Protect your SIP Servers from bad actors at https://sentrypeer.org

52
Established
184 arkflow-rs/arkflow

High performance Rust stream processing engine seamlessly integrates AI...

52
Established
185 kalininalab/DataSAIL

DataSAIL is a tool to split datasets while reducing information leakage.

52
Established
186 zazuko/barnard59

An intuitive and flexible RDF pipeline solution designed to simplify and...

52
Established
187 treeverse/charts

Helm charts

52
Established
188 turbot/steampipe-plugin-slack

Use SQL to instantly query users, channels, emoji and more from your Slack...

52
Established
189 fdmorison/tiozin

Tiozin, your friendly ETL framework

52
Established
190 evdubs/oic-options-chains

ETL for OIC Options Chains

52
Established
191 Edwardvaneechoud/Flowfile

Flowfile is a visual ETL tool and Python library combining drag-and-drop...

52
Established
192 Bruno-Furtado/cloud-cnpj

Ingestão, preparação e disponibilização gratuita de dados de CNPJs de...

52
Established
193 PFund-Software-Ltd/pfeed

Data Engine for Manual/Algo Trading: Download/Stream -> Clean -> Store....

52
Established
194 AndreaBozzo/dataprof

Library and CLI for profiling tabular data

52
Established
195 lakevision-project/lakevision

Lakevision is a tool which provides insights into your Apache Iceberg based...

52
Established
196 halestudio/hale

(Spatial) data harmonisation with hale»studio (formerly HUMBOLDT Alignment Editor)

52
Established
197 DataRecce/recce

The data-validation toolkit for enhanced dbt (data build tool) PR review

52
Established
198 ara3d/bim-open-schema

Representing BIM Data as Parquet

52
Established
199 DataKitchen/data-observability-installer

Installer for DataKitchen's Open Source Data Observability Products. Data...

52
Established
200 turbot/steampipe-plugin-azuread

Use SQL to instantly query groups, service principals, users and more from...

52
Established