Web Scraping NLP Pipelines NLP Tools

End-to-end systems that combine web scraping with NLP analysis (sentiment, readability, topic modeling, entity extraction) on text extracted from websites, articles, or online sources. Does NOT include standalone scraping tools, NLP libraries, or applications that only perform analysis without web data extraction.

There are 96 web scraping nlp pipelines tools tracked. 1 score above 70 (verified tier). The highest-rated is flairNLP/fundus at 72/100 with 443 stars.

Get all 96 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=nlp&subcategory=web-scraping-nlp-pipelines&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 flairNLP/fundus

A very simple news crawler with a funny name

72
Verified
2 fhamborg/news-please

news-please - an integrated web crawler and information extractor for news...

61
Established
3 affjljoo3581/canrevan

대량의 네이버 뉴스 기사를 수집하는 라이브러리입니다.

53
Established
4 FreeDiscovery/FreeDiscovery

Web Service for E-Discovery Analytics

53
Established
5 tirthajyoti/Web-Database-Analytics

Web scrapping and related analytics using Python tools

51
Established
6 Multiverse-of-Projects/NewsAI

A dynamic NewsAI dashboard that uses NLP to analyze news articles, visualize...

49
Emerging
7 rajaswa/DRIFT

DRIFT is a tool for Diachronic Analysis of Scientific Literature.

46
Emerging
8 smyja/blackmaria

Python package for webscraping in Natural language

46
Emerging
9 MasuRii/FBScrapeIdeas

Modern CLI tool for scraping & analyzing Facebook groups using Playwright &...

45
Emerging
10 kevalmorabia97/SEDTWik-Event-Detection-from-Tweets

Segmentation based event detection from Tweets. Published at NAACL SRW 2019

43
Emerging
11 uhh-lt/newsleak

Information extraction and interactive visualization of textual datasets for...

43
Emerging
12 vipul-sharma20/sharingan

Tool to extract news articles from newspaper and give the context about the news

42
Emerging
13 sandeep-sandhu/NewsLookout

The NewsLookout web scraping application with NLP and data pre-processing

41
Emerging
14 FinnishCancerRegistry/gleason_extraction_py

Extract Gleason scores from texts.

40
Emerging
15 uscensusbureau/SABLE

Scraping Assisted by Learning

39
Emerging
16 ahmedbesbes/How-to-mine-newsfeed-data-and-extract-interactive-insights-in-Python

A practical guide to topic mining and interactive visualizations

39
Emerging
17 Sotera/watchman

Watchman: An open-source social-media event-detection system

38
Emerging
18 nawaz-kmr/Data_Extraction_and_Text_Analysis_for_Blackcoffer_company.

The objective of this assignment is to extract textual data articles from...

36
Emerging
19 VIDA-NYU/domain_discovery_API

Domain Discovery Operations API formalizes the human domain discovery...

35
Emerging
20 scrapegoat/scrapegoat

Scrape Data in One-shot.

35
Emerging
21 Just-Helpful/preventable-deaths-scraper

Web scraper, written for the Preventable Deaths website, with emphasis on...

34
Emerging
22 nakuleshj/news-nlp-pipeline

A fully serverless, event-driven data pipeline that ingests, enriches,...

34
Emerging
23 Jasiri-App/datagpu

DataGPU is an open-source data compiler for AI pipelines that helps you...

34
Emerging
24 networkdynamics/seldonite

A News Article Collection Library

33
Emerging
25 victoria217-bottino/google-news-scraper

# 📰 Google News Scraper A Python tool to fetch, decode, and process...

33
Emerging
26 lkstrp/newspaper-scraper

The all-in-one Python package for seamless newspaper article indexing,...

33
Emerging
27 nostoz/news_monitor

Real time news monitor aggregating from various sources based on keywords

31
Emerging
28 gangula-karthik/KAKI-App

A web app uniting everyone for big wins and a greener Singapore! 🚀🌳

31
Emerging
29 ZIADEA/SmartWebScraper-CV

SmartWebScraper-CV – AI-Powered Web Page Zone Detection SmartWebScraper-CV...

31
Emerging
30 SakuraPuare/ZhiHu_Spider

知乎内容爬虫 | Web scraper for Zhihu content extraction

31
Emerging
31 ntddk/peeling-onions

A repository to store Deep Web (onion domain) crawler, scraper, and NLP...

30
Emerging
32 BioinfoNet/Data-mining

Data mining to discover trends in Open Science in Kenya

30
Emerging
33 jasp9559/Web-Scraping-of-Indian-Judgements

Web scraping project for scraping the latest/most recent judgement taken on the day

30
Emerging
34 antoninfaure/rssTrends

Finding Topics in French News using RSS Feeds

30
Emerging
35 sodalabsio/event-detection-extraction

Repository for QA-based event detection and extraction from news and social media.

29
Experimental
36 susannapaoli/web-scraper-nyt

New York Times Scraper

29
Experimental
37 aybarskerem/WebScraper

This repo contains Various WebScrapers for different sites and process the...

29
Experimental
38 GateNLP/wpextract

Create datasets from WordPress sites for research or archiving

29
Experimental
39 bhx98/NameAnalysis

Choosing a company name by analyzing the most used keywords in the field and...

29
Experimental
40 jpwahle/cs-insights-crawler

This repository implements the interaction with DBLP, information extraction...

28
Experimental
41 dobbersc/fundus-evaluation

[ACL 2024] Evaluation of the Fundus News Scraper

28
Experimental
42 Atharv279/Task-Extraction-NLP

NLP-based Task Extraction & Categorization | This project extracts tasks...

27
Experimental
43 Awakumori/NGAspider

NGA论坛(艾泽拉斯国家地理)爬虫工具。采用多线程采集,MongoDB存储,集成PaddlePaddle进行NLP。整合百度解语进行实体识别,更新NLP情...

27
Experimental
44 WISETICT-PPAM/Data-Analytics

제품 정보 크롤링 및 리뷰 텍스트 마이닝

26
Experimental
45 agi-templar/MediaCloudDataDownloader

Download full-length articles from media outlets.

26
Experimental
46 balaurian/fx_news_scraper

A scraper for investing.com forex news using beautifulsoup and nltk. It also...

26
Experimental
47 dukeblue1994-glitch/chronicle

Intelligent event detection system using semantic embeddings, MinHash LSH...

25
Experimental
48 someoneorlov/styx

ML News Analysis Service

25
Experimental
49 AmmarRashed/EventOrient

A web-based application for monitoring, analyzing and visualizing social...

25
Experimental
50 Kamomille/WebScrapping_Supermarket

Analyse des coûts des supermarchés

25
Experimental
51 zer0Percent/OhWowBREAKINGNews

A multithreaded scraper to retrieve and parse new's articles.

25
Experimental
52 samuelhatcliff/newstracker

News Tracker is an application designed to enhance and optimize the way that...

24
Experimental
53 nivaangupta/news-website

A news website that provides summarised news on trending topics, popular...

22
Experimental
54 stkisengese/news-intelligence-nlp-platform

A Python-based NLP platform for scraping, analyzing, and enriching news...

22
Experimental
55 georgiarichards/preventabledeathstracker

Code for running the Preventable Deaths Tracker website

22
Experimental
56 MANISH007700/NewsArticleExtraction

Extraction of News Article from different News Web Pages using feedparser...

22
Experimental
57 asaifuddin18/Search-Engine-Data-Collector

Summer '21 research project under Forward Data Lab group. Django website...

21
Experimental
58 stuartemiddleton/floraguard_crawler

FloraGuard crawler for online forums and marketplaces around the illegal...

20
Experimental
59 moehmeni/ezweb

Easy to use web page analyzer

20
Experimental
60 umutkavakli/sikayetvar-scraping

A scraping tool for customer complaints of specified brands to use in NLP tasks.

19
Experimental
61 satyampandey1411/SAT-News-Analyser

SAT News Analyser is a web application offering in-depth news article...

19
Experimental
62 b-i-king/Top_News_Twitter_Bot_Template

Twitter Bot Template

19
Experimental
63 javiermascarena/footy-narratives

Automated weekly storylines and topic summaries for the “Big Six” English...

18
Experimental
64 Anonym0usWork1221/python-code-docstring-scraper

A multi-threaded GitHub scraper to collect Python code with docstrings from...

18
Experimental
65 utkarsh512/CreateDebateScraper

Scraping debates from the CreateDebate forum

18
Experimental
66 Biswas-N/Norman-PD-incidents-extractor

Python based utility to create Norman Police Department's incident dataset...

17
Experimental
67 LiliValGo/NLP-for-IPCC-Climate-Reports

This project combines web scraping, PDF processing, and Natural Language...

17
Experimental
68 ArpitaChatterjee/Routine-Analysis-of-a-Comedian

Build a dataset using the transcript for the 10 popular comedians, using web...

17
Experimental
69 Onaga08/scrape-and-sense

A comprehensive script for web scraping and NLP analysis, providing detailed...

17
Experimental
70 doinakis/Real-Time-News-Assistant

Real Time News Asstistant for Greek news.

17
Experimental
71 eyereece/nlp-text-mining-dashboard

nlp text mining dashboard to explore current trends and extract most used...

15
Experimental
72 J-TECH-bot/Blackcoffer_Data_Extraction_NLP

This repository showcases data-driven text analytics using NLP techniques....

13
Experimental
73 estefaniagPerez/net-analyzer-sna-nlp-analysis

This project (ReactJS and Python) combines Social Network Analysis (SNA) and...

12
Experimental
74 DolbyUUU/event-timeline-generation-olympics

A toy system for generating event timelines from social media data,...

12
Experimental
75 SaltyGod/Text-Data-Mining

一个标准的文本爬取、进行深度挖掘分析的全流程项目

12
Experimental
76 adityamangal1/Web-Scraping

web data extraction

12
Experimental
77 AtulJoshi1/ProductDescription2Keywords

Extracting Search Engine Appropriate Keywords and Key Selling Points from a...

12
Experimental
78 nikitaprasad21/Data-Extraction-and-NLP

Performed Data Extraction and NLP Analysis

12
Experimental
79 IshtyM/Data-Extraction-and-Text-Analytics

Text Analysis that includes extraction of word count, Positive Score,...

11
Experimental
80 ElfatihZiad/BBCNews-scraper-nlp

A data pipeline to extract News articles from BBC News, storing it to...

11
Experimental
81 vansh-py04/Data-Extraction-and-Text-Analysis

The objective of this assignment is to extract textual data articles from...

11
Experimental
82 pranjal-pravesh/web-article-analyzer

A comprehensive text analysis system that performs web scraping, sentiment...

11
Experimental
83 QuhiQuhihi/news_analysis

crawling news data and extract keywords from article

11
Experimental
84 DRSarcenoR/fetchNews

Aplicación en Streamlit que dado el prompt (se espera un nombre), muestre...

11
Experimental
85 Mreeb/TOpic_name_eXtraction

Department of Justice 2009-2018 Press Releases Data and reading Analysing...

11
Experimental
86 kshitijbhandari/Web-Scraping-and-text-analysis

NLP pipeline to scrape 114 articles using BeautifulSoup and compute 13...

11
Experimental
87 Haimonmon/snippy

A Book scraping bot that ables to give you books data, but be cautious as...

11
Experimental
88 AnFrBo/internet_censorship

Analysis of the State of Internet Censorship in the United Kingdom Using...

11
Experimental
89 crackalamoo/web-nlp-scraper

A command line tool to quickly run natural language processing (NLP)...

11
Experimental
90 yashvardhanv/Atomic-news3.0

Upgraded version of AtomicNews2.0 with login/signup features.

10
Experimental
91 rogerchang1108/Cambridge-Dictionary-Web-Scraper

In this project, we employ the BeautifulSoup4 package in Python Jupyter...

10
Experimental
92 tasozgurcem11/eksi-analysis

Collect and analyze eksi forum public entries

10
Experimental
93 mccormd1/RandM_Transcript_Sentiment_Analysis

Various html scraping and NLP techniques applied to Rick & Morty transcripts.

10
Experimental
94 liuzl/newsmth

A go crawler for newsmth.net

10
Experimental
95 solinode/narratix

tuned to the noise before it becomes signal.

10
Experimental
96 krishgoyal0/BookMyShow_event_scrapper_automation

This is a project made for automating data scrapping from a particular Event...

10
Experimental