commoncrawl/news-crawl

News crawling with StormCrawler - stores content as WARC

/ 100

Established

This tool helps researchers and data scientists systematically collect publicly available news articles from a broad range of online sources. You provide a list of news feeds (RSS/Atom) and news sitemaps, and the tool fetches the content, storing it in standardized WARC files. It's designed for anyone needing a large, structured dataset of news content for analysis or archival purposes.

365 stars.

Use this if you need to build your own comprehensive archive or dataset of news articles from specific online publishers.

Not ideal if you just need to read a few articles or want pre-packaged news datasets without running your own infrastructure.

news-archiving data-collection media-monitoring web-crawling content-acquisition

No Package No Dependents

Maintenance 13 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 16 / 25

How are scores calculated?

Stars

365

Forks

Language

Java

License

Apache-2.0

Featured in

Giving AI Agents Eyes: Browser Automation in 2026

Related tools

scrapy/scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.

Altimis/Scweet

A simple and unlimited twitter scraper : scrape tweets, likes, retweets, following, followers,...

lexiforest/curl_cffi

Python binding for curl-impersonate fork via cffi. A http client that can impersonate browser...

plabayo/rama

modular service framework to move and transform network packets

scrapinghub/spidermon

Scrapy Extension for monitoring spiders execution.

Explore Perception Tools

All categories Trending Perception directory Insights