neuml/ragdata
📚 Build knowledge bases for RAG
This project helps AI developers and researchers build comprehensive knowledge bases for Retrieval Augmented Generation (RAG) applications. It takes raw data from large datasets like ArXiv and Wikipedia, processes them, and outputs structured embedding databases. These databases are then used by RAG systems to retrieve relevant information efficiently.
No commits in the last 6 months. Available on PyPI.
Use this if you are an AI developer or researcher looking to create or utilize pre-built knowledge bases from common public datasets for RAG models.
Not ideal if you need to build knowledge bases from proprietary or highly specialized internal datasets not already supported by this tool.
Stars
32
Forks
2
Language
Python
License
Apache-2.0
Category
Last pushed
Jul 03, 2025
Commits (30d)
0
Dependencies
5
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/rag/neuml/ragdata"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
ItzCrazyKns/Vane
Vane is an AI-powered answering engine.
ConardLi/easy-dataset
A powerful tool for creating datasets for LLM fine-tuning 、RAG and Eval
xuwei95/ezdata
基于python和llm大模型开发的数据处理和任务调度系统。...
ModelEngine-Group/DataMate
DataMate is an enterprise-level data processing platform designed for model fine-tuning and RAG...
DS4SD/deepsearch-toolkit
Interact with the Deep Search platform for new knowledge explorations and discoveries