harvard-lil/warc-gpt

WARC + AI - Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.

41
/ 100
Emerging

This tool helps archivists, researchers, and historians explore web archive collections. It takes Web ARChive (WARC) files, extracts text from HTML and PDF records, and lets you ask questions about their content using a chat interface. The output is answers derived from the archived data, making it easier to understand large collections without manually sifting through files.

270 stars. No commits in the last 6 months.

Use this if you need to quickly find information or ask questions about content stored within Web ARChive (WARC) files and want to leverage AI for intelligent retrieval.

Not ideal if you primarily need to preserve web pages as they appeared visually, rather than analyze their text content.

web-archiving digital-humanities research-analysis information-retrieval historical-research
Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 15 / 25

How are scores calculated?

Stars

270

Forks

25

Language

Python

License

MIT

Last pushed

Feb 11, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/rag/harvard-lil/warc-gpt"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.