sangaline/wayback-machine-scraper

A command-line utility and Scrapy middleware for scraping time series data from Archive.org's Wayback Machine.

57
/ 100
Established

This tool helps you gather historical versions of websites directly from the Internet Archive's Wayback Machine. You provide a website address and a time range, and it downloads all available webpage content for that period. This is useful for researchers, analysts, or anyone tracking how websites have evolved over time.

474 stars. No commits in the last 6 months. Available on PyPI.

Use this if you need to capture a website's content as it appeared at various points in the past, especially for sites that are difficult to scrape directly.

Not ideal if you need to extract specific data fields from the archived pages, as this tool only saves the raw HTML content, not parsed data.

web-archiving historical-data-collection competitive-intelligence digital-forensics market-research
Stale 6m
Maintenance 0 / 25
Adoption 10 / 25
Maturity 25 / 25
Community 22 / 25

How are scores calculated?

Stars

474

Forks

82

Language

Python

License

ISC

Category

scraper

Last pushed

Feb 23, 2024

Commits (30d)

0

Dependencies

4

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/perception/sangaline/wayback-machine-scraper"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.