digitalcortex/72m-domains-dataset
Dataset with unique registered domains extracted from Common Crawl's columnar index (cc-index).
This dataset provides a vast list of over 72 million unique registered website domains, compiled from multiple snapshots of the Common Crawl web archive. It helps web researchers and search engine developers discover new websites to index or analyze, giving them a comprehensive starting point for their crawling operations. You get a raw list of domain names that can be fed directly into your web discovery systems.
Use this if you need a very large, deduplicated list of active website domains to kickstart a search engine, build a web crawler, or conduct large-scale web research.
Not ideal if you need detailed information about each domain beyond just its name, or if you require highly targeted domain lists for specific niches.
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/perception/digitalcortex/72m-domains-dataset"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
scrapy/scrapy
Scrapy, a fast high-level web crawling & scraping framework for Python.
Altimis/Scweet
A simple and unlimited twitter scraper : scrape tweets, likes, retweets, following, followers,...
lexiforest/curl_cffi
Python binding for curl-impersonate fork via cffi. A http client that can impersonate browser...
plabayo/rama
modular service framework to move and transform network packets
scrapinghub/spidermon
Scrapy Extension for monitoring spiders execution.