biraj21/web-wanderer
A multi-threaded web crawler written in Python, utilizing ThreadPoolExecutor and Playwright to efficiently crawl dynamically rendered web pages and download them.
This tool helps you quickly gather content from many web pages on a website, even those built with modern interactive technologies like JavaScript. You provide a starting web address, and it downloads the visible content of linked pages, saving them into a designated folder on your computer. It's ideal for anyone who needs to collect website content for research, archiving, or analysis.
No commits in the last 6 months.
Use this if you need to download and save content from a website, including those that load information dynamically after the initial page view.
Not ideal if you only need to extract specific data fields rather than entire page content, or if you require advanced data parsing and structuring.
Stars
22
Forks
1
Language
Python
License
MIT
Category
Last pushed
Nov 30, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/perception/biraj21/web-wanderer"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
seleniumbase/SeleniumBase
APIs for browser automation, testing, and bypassing bot-detection.
apify/crawlee-python
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers....
intoli/user-agents
A JavaScript library for generating random user agents with data that's updated daily.
apify/crawlee
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In...
Kaliiiiiiiiii-Vinyzu/patchright
Undetected version of the Playwright testing and automation library.