commoncrawl/news-crawl

News crawling with StormCrawler - stores content as WARC

55
/ 100
Established

This tool helps researchers and data scientists systematically collect publicly available news articles from a broad range of online sources. You provide a list of news feeds (RSS/Atom) and news sitemaps, and the tool fetches the content, storing it in standardized WARC files. It's designed for anyone needing a large, structured dataset of news content for analysis or archival purposes.

365 stars.

Use this if you need to build your own comprehensive archive or dataset of news articles from specific online publishers.

Not ideal if you just need to read a few articles or want pre-packaged news datasets without running your own infrastructure.

news-archiving data-collection media-monitoring web-crawling content-acquisition
No Package No Dependents
Maintenance 13 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 16 / 25

How are scores calculated?

Stars

365

Forks

39

Language

Java

License

Apache-2.0

Category

scraper

Last pushed

Mar 31, 2026

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/perception/commoncrawl/news-crawl"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.