USCDataScience/sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.

51
/ 100
Established

Sparkler is a web crawling tool designed for collecting vast amounts of web data efficiently. You provide it with a list of starting URLs, and it systematically fetches web pages, extracts their content, and makes it available for building applications like search engines or knowledge bases. This tool is ideal for data engineers, researchers, or anyone needing to collect, analyze, and store large-scale web content.

420 stars. No commits in the last 6 months.

Use this if you need to build a comprehensive dataset from the web for analytical applications, research, or content aggregation, and require high performance and fault tolerance.

Not ideal if you only need to scrape a small number of pages or require a simple, quick-and-dirty script for occasional data extraction.

web-data-collection data-engineering information-retrieval search-engine-development big-data-analytics
Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 25 / 25

How are scores calculated?

Stars

420

Forks

138

Language

Java

License

Apache-2.0

Category

scraper

Last pushed

Mar 30, 2023

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/perception/USCDataScience/sparkler"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.