Pro-GenAI/AutoPureData

Automated Filtering of Undesirable Web Data to Update LLM Knowledge

/ 100

Emerging

This project helps anyone maintaining a large language model (LLM) keep its knowledge up-to-date with current web information. It takes raw, unfiltered web data and automatically cleans it, removing unsafe content, unreliable sources, personal details, and adversarial attacks. The output is a refined dataset ready for updating your LLM, ensuring it provides accurate and safe responses.

No commits in the last 6 months.

Use this if you need to feed your LLM the latest web information but are concerned about the quality and safety of raw internet data.

Not ideal if you need a solution for a production environment, as this project is currently intended for educational and research purposes only.

LLM-maintenance data-curation web-scraping AI-safety knowledge-base-update

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 4 / 25

Maturity 16 / 25

Community 13 / 25

How are scores calculated?

Stars

Forks

Language

Jupyter Notebook

License

—

Higher-rated alternatives

WangRongsheng/awesome-LLM-resources

🧑‍🚀 全世界最好的LLM资料总结（多模态生成、Agent、辅助编程、AI审稿、数据处理、模型训练、模型推理、o1 模型、MCP、小语言模型、视觉语言模型） | Summary of the...

SylphAI-Inc/AdalFlow

AdalFlow: The library to build & auto-optimize LLM applications.

LazyAGI/LazyLLM

Easiest and laziest way for building multi-agent LLMs applications.

luhengshiwo/LLMForEverybody

每个人都能看懂的大模型知识分享，LLMs春/秋招大模型面试前必看，让你和面试官侃侃而谈

katanaml/sparrow

Structured data extraction and instruction calling with ML, LLM and Vision LLM

Explore RAG Tools

All categories Trending RAG directory Insights