USCDataScience/sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.

/ 100

Established

Sparkler is a web crawling tool designed for collecting vast amounts of web data efficiently. You provide it with a list of starting URLs, and it systematically fetches web pages, extracts their content, and makes it available for building applications like search engines or knowledge bases. This tool is ideal for data engineers, researchers, or anyone needing to collect, analyze, and store large-scale web content.

420 stars. No commits in the last 6 months.

Use this if you need to build a comprehensive dataset from the web for analytical applications, research, or content aggregation, and require high performance and fault tolerance.

Not ideal if you only need to scrape a small number of pages or require a simple, quick-and-dirty script for occasional data extraction.

web-data-collection data-engineering information-retrieval search-engine-development big-data-analytics

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 25 / 25

How are scores calculated?

Stars

420

Forks

138

Language

Java

License

Apache-2.0

Featured in

Giving AI Agents Eyes: Browser Automation in 2026

Related tools

scrapy/scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.

Altimis/Scweet

A simple and unlimited twitter scraper : scrape tweets, likes, retweets, following, followers,...

lexiforest/curl_cffi

Python binding for curl-impersonate fork via cffi. A http client that can impersonate browser...

plabayo/rama

modular service framework to move and transform network packets

scrapinghub/spidermon

Scrapy Extension for monitoring spiders execution.

Explore Perception Tools

All categories Trending Perception directory Insights