harvard-lil/warc-gpt
WARC + AI - Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.
This tool helps archivists, researchers, and historians explore web archive collections. It takes Web ARChive (WARC) files, extracts text from HTML and PDF records, and lets you ask questions about their content using a chat interface. The output is answers derived from the archived data, making it easier to understand large collections without manually sifting through files.
270 stars. No commits in the last 6 months.
Use this if you need to quickly find information or ask questions about content stored within Web ARChive (WARC) files and want to leverage AI for intelligent retrieval.
Not ideal if you primarily need to preserve web pages as they appeared visually, rather than analyze their text content.
Stars
270
Forks
25
Language
Python
License
MIT
Category
Last pushed
Feb 11, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/rag/harvard-lil/warc-gpt"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
Kain-90/RAG-Play
An interactive visualization tool for understanding Retrieval-Augmented Generation (RAG) pipelines.
rryam/LumoKit
Swift package for on-device Retrieval-Augmented Generation (RAG)
CoIR-team/coir
(ACL 2025 Main) A Comprehensive Benchmark for Code Information Retrieval.
constacts/ragtacts
RAG(Retrieval-Augmented Generation) for Evolving Data
giuliano-t/openAI-to-freeCAD-workflow
This project uses a Large Language Model (LLM) with Retrieval-Augmented Generation (RAG) to...