alireza-heidarii/Real-Time-Data-Cleaning-Pipeline-for-Medical-and-Healthcare-Data

A real-time data cleaning pipeline for medical and healthcare data using Apache Spark, SparkNLP, Spark Streaming, and Kafka.

32
/ 100
Emerging

This project helps healthcare organizations automatically clean and extract crucial information from rapidly arriving medical and healthcare text. It takes in raw, unstructured text streams, often containing HTML, and outputs structured, cleaned data in a format ready for analysis. This is designed for data engineers or IT professionals managing data infrastructure in a healthcare setting.

No commits in the last 6 months.

Use this if you need to process large volumes of incoming medical text data in real-time, cleaning it and identifying sensitive information before storage.

Not ideal if your data is not streaming or if you primarily need in-depth natural language understanding beyond entity recognition and redaction.

healthcare data medical records PII redaction real-time analytics data pipeline
Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 5 / 25
Maturity 16 / 25
Community 11 / 25

How are scores calculated?

Stars

13

Forks

2

Language

Python

License

Apache-2.0

Last pushed

Mar 18, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/data-engineering/alireza-heidarii/Real-Time-Data-Cleaning-Pipeline-for-Medical-and-Healthcare-Data"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.