harvard-lil/warc-gpt

WARC + AI - Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.

/ 100

Emerging

This tool helps archivists, researchers, and historians explore web archive collections. It takes Web ARChive (WARC) files, extracts text from HTML and PDF records, and lets you ask questions about their content using a chat interface. The output is answers derived from the archived data, making it easier to understand large collections without manually sifting through files.

270 stars. No commits in the last 6 months.

Use this if you need to quickly find information or ask questions about content stored within Web ARChive (WARC) files and want to leverage AI for intelligent retrieval.

Not ideal if you primarily need to preserve web pages as they appeared visually, rather than analyze their text content.

web-archiving digital-humanities research-analysis information-retrieval historical-research

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 15 / 25

How are scores calculated?

Stars

270

Forks

Language

Python

License

MIT

Higher-rated alternatives

Kain-90/RAG-Play

An interactive visualization tool for understanding Retrieval-Augmented Generation (RAG) pipelines.

rryam/LumoKit

Swift package for on-device Retrieval-Augmented Generation (RAG)

CoIR-team/coir

(ACL 2025 Main) A Comprehensive Benchmark for Code Information Retrieval.

constacts/ragtacts

RAG(Retrieval-Augmented Generation) for Evolving Data

giuliano-t/openAI-to-freeCAD-workflow

This project uses a Large Language Model (LLM) with Retrieval-Augmented Generation (RAG) to...

Explore RAG Tools

All categories Trending RAG directory Insights