alireza-heidarii/Real-Time-Data-Cleaning-Pipeline-for-Medical-and-Healthcare-Data

A real-time data cleaning pipeline for medical and healthcare data using Apache Spark, SparkNLP, Spark Streaming, and Kafka.

/ 100

Emerging

This project helps healthcare organizations automatically clean and extract crucial information from rapidly arriving medical and healthcare text. It takes in raw, unstructured text streams, often containing HTML, and outputs structured, cleaned data in a format ready for analysis. This is designed for data engineers or IT professionals managing data infrastructure in a healthcare setting.

No commits in the last 6 months.

Use this if you need to process large volumes of incoming medical text data in real-time, cleaning it and identifying sensitive information before storage.

Not ideal if your data is not streaming or if you primarily need in-depth natural language understanding beyond entity recognition and redaction.

healthcare data medical records PII redaction real-time analytics data pipeline

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 5 / 25

Maturity 16 / 25

Community 11 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

Apache-2.0

Higher-rated alternatives

melvynator/ELK_twitter

This is a data pipeline for Twitter (ETL) using the elastic stack Elasticsearch, Logstash and...

sergio11/covid_tweets_etl_architecture

📚🧪 This is a learning-focused POC that explores a microservices ETL architecture for real-time...

Wazzabeee/pyspark-etl-twitter

Implementation of an ETL process for real-time sentiment analysis of tweets with Docker, Apache...

msloan10/Bitcoin-Dashboard

A Bitcoin dashboard that incorpoartes sentiment analysis using Twitter data.

adilsaid64/sentiment-stream

An end-to-end real-time data streaming pipeline that leverages Kafka and Spark Streaming to...

Explore Data Engineering Tools

All categories Trending Data Engineering directory Insights