Pro-GenAI/AutoPureData

Automated Filtering of Undesirable Web Data to Update LLM Knowledge

35
/ 100
Emerging

This project helps anyone maintaining a large language model (LLM) keep its knowledge up-to-date with current web information. It takes raw, unfiltered web data and automatically cleans it, removing unsafe content, unreliable sources, personal details, and adversarial attacks. The output is a refined dataset ready for updating your LLM, ensuring it provides accurate and safe responses.

No commits in the last 6 months.

Use this if you need to feed your LLM the latest web information but are concerned about the quality and safety of raw internet data.

Not ideal if you need a solution for a production environment, as this project is currently intended for educational and research purposes only.

LLM-maintenance data-curation web-scraping AI-safety knowledge-base-update
Stale 6m No Package No Dependents
Maintenance 2 / 25
Adoption 4 / 25
Maturity 16 / 25
Community 13 / 25

How are scores calculated?

Stars

8

Forks

2

Language

Jupyter Notebook

License

Last pushed

Sep 18, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/rag/Pro-GenAI/AutoPureData"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.