touero/ctenopharyngodon-idella
Use the MapReduce's Java interface to distributed crawle the data of Chinese universities and learn basic knowledge of hdfs.
This project helps you gather publicly available data about Chinese universities, such as information often found on websites like 'ζδΈι«θ' (Gaokao.cn). It takes URLs of university pages and extracts structured data, which can then be used for analysis or database population. This tool is for data engineers or researchers who need to collect large datasets from the web, specifically about educational institutions in China.
134 stars. No commits in the last 6 months.
Use this if you need to systematically collect and store comprehensive data from multiple Chinese university websites, especially those that use JavaScript to load content.
Not ideal if you're looking to crawl data from websites that are not Chinese universities, or if you don't have experience setting up and managing a distributed computing environment like Hadoop.
Stars
134
Forks
1
Language
Java
License
Apache-2.0
Category
Last pushed
Oct 16, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/perception/touero/ctenopharyngodon-idella"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
scrapy/scrapy
Scrapy, a fast high-level web crawling & scraping framework for Python.
Altimis/Scweet
A simple and unlimited twitter scraper : scrape tweets, likes, retweets, following, followers,...
lexiforest/curl_cffi
Python binding for curl-impersonate fork via cffi. A http client that can impersonate browser...
plabayo/rama
modular service framework to move and transform network packets
scrapinghub/spidermon
Scrapy Extension for monitoring spiders execution.