digitalcortex/72m-domains-dataset

Dataset with unique registered domains extracted from Common Crawl's columnar index (cc-index).

39
/ 100
Emerging

This dataset provides a vast list of over 72 million unique registered website domains, compiled from multiple snapshots of the Common Crawl web archive. It helps web researchers and search engine developers discover new websites to index or analyze, giving them a comprehensive starting point for their crawling operations. You get a raw list of domain names that can be fed directly into your web discovery systems.

Use this if you need a very large, deduplicated list of active website domains to kickstart a search engine, build a web crawler, or conduct large-scale web research.

Not ideal if you need detailed information about each domain beyond just its name, or if you require highly targeted domain lists for specific niches.

web-crawling search-engine-optimization internet-research domain-discovery competitive-intelligence
No Package No Dependents
Maintenance 10 / 25
Adoption 6 / 25
Maturity 11 / 25
Community 12 / 25

How are scores calculated?

Stars

19

Forks

3

Language

License

Category

scraper

Last pushed

Mar 03, 2026

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/perception/digitalcortex/72m-domains-dataset"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.