commoncrawl/news-crawl
News crawling with StormCrawler - stores content as WARC
This tool helps researchers and data scientists systematically collect publicly available news articles from a broad range of online sources. You provide a list of news feeds (RSS/Atom) and news sitemaps, and the tool fetches the content, storing it in standardized WARC files. It's designed for anyone needing a large, structured dataset of news content for analysis or archival purposes.
365 stars.
Use this if you need to build your own comprehensive archive or dataset of news articles from specific online publishers.
Not ideal if you just need to read a few articles or want pre-packaged news datasets without running your own infrastructure.
Stars
365
Forks
39
Language
Java
License
Apache-2.0
Category
Last pushed
Mar 31, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/perception/commoncrawl/news-crawl"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
scrapy/scrapy
Scrapy, a fast high-level web crawling & scraping framework for Python.
Altimis/Scweet
A simple and unlimited twitter scraper : scrape tweets, likes, retweets, following, followers,...
lexiforest/curl_cffi
Python binding for curl-impersonate fork via cffi. A http client that can impersonate browser...
plabayo/rama
modular service framework to move and transform network packets
scrapinghub/spidermon
Scrapy Extension for monitoring spiders execution.