digitalcortex/72m-domains-dataset

Dataset with unique registered domains extracted from Common Crawl's columnar index (cc-index).

/ 100

Emerging

This dataset provides a vast list of over 72 million unique registered website domains, compiled from multiple snapshots of the Common Crawl web archive. It helps web researchers and search engine developers discover new websites to index or analyze, giving them a comprehensive starting point for their crawling operations. You get a raw list of domain names that can be fed directly into your web discovery systems.

Use this if you need a very large, deduplicated list of active website domains to kickstart a search engine, build a web crawler, or conduct large-scale web research.

Not ideal if you need detailed information about each domain beyond just its name, or if you require highly targeted domain lists for specific niches.

web-crawling search-engine-optimization internet-research domain-discovery competitive-intelligence

No Package No Dependents

Maintenance 10 / 25

Adoption 6 / 25

Maturity 11 / 25

Community 12 / 25

How are scores calculated?

Stars

Forks

Language

—

License

—

Featured in

Giving AI Agents Eyes: Browser Automation in 2026

Higher-rated alternatives

scrapy/scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.

Altimis/Scweet

A simple and unlimited twitter scraper : scrape tweets, likes, retweets, following, followers,...

lexiforest/curl_cffi

Python binding for curl-impersonate fork via cffi. A http client that can impersonate browser...

plabayo/rama

modular service framework to move and transform network packets

scrapinghub/spidermon

Scrapy Extension for monitoring spiders execution.

Explore Perception Tools

All categories Trending Perception directory Insights