alireza-heidarii/Real-Time-Data-Cleaning-Pipeline-for-Medical-and-Healthcare-Data
A real-time data cleaning pipeline for medical and healthcare data using Apache Spark, SparkNLP, Spark Streaming, and Kafka.
This project helps healthcare organizations automatically clean and extract crucial information from rapidly arriving medical and healthcare text. It takes in raw, unstructured text streams, often containing HTML, and outputs structured, cleaned data in a format ready for analysis. This is designed for data engineers or IT professionals managing data infrastructure in a healthcare setting.
No commits in the last 6 months.
Use this if you need to process large volumes of incoming medical text data in real-time, cleaning it and identifying sensitive information before storage.
Not ideal if your data is not streaming or if you primarily need in-depth natural language understanding beyond entity recognition and redaction.
Stars
13
Forks
2
Language
Python
License
Apache-2.0
Category
Last pushed
Mar 18, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/data-engineering/alireza-heidarii/Real-Time-Data-Cleaning-Pipeline-for-Medical-and-Healthcare-Data"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
melvynator/ELK_twitter
This is a data pipeline for Twitter (ETL) using the elastic stack Elasticsearch, Logstash and...
sergio11/covid_tweets_etl_architecture
๐๐งช This is a learning-focused POC that explores a microservices ETL architecture for real-time...
Wazzabeee/pyspark-etl-twitter
Implementation of an ETL process for real-time sentiment analysis of tweets with Docker, Apache...
msloan10/Bitcoin-Dashboard
A Bitcoin dashboard that incorpoartes sentiment analysis using Twitter data.
adilsaid64/sentiment-stream
An end-to-end real-time data streaming pipeline that leverages Kafka and Spark Streaming to...