currentslab/extractnet
A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one package
When you need to collect detailed information from news articles or other web pages, this tool helps you automatically extract the main content, along with specific details like the author, headline, publication date, and keywords. It takes the raw HTML of a webpage and provides you with structured text data. This is ideal for data analysts, researchers, or anyone building an application that needs to process web content without manual effort.
300 stars. No commits in the last 6 months.
Use this if you need to reliably extract content and metadata from many webpages, especially when standard methods struggle to find details like the true author within the article text.
Not ideal if you primarily need to remove repetitive navigation or advertisement sections (boilerplate) from a webpage, as this tool focuses on extracting the core content and its attributes.
Stars
300
Forks
26
Language
HTML
License
MIT
Category
Last pushed
May 19, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/currentslab/extractnet"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
chakki-works/seqeval
A Python framework for sequence labeling evaluation(named-entity recognition, pos tagging, etc...)
Hironsan/anago
Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.
jbesomi/texthero
Text preprocessing, representation and visualization from zero to hero.
hamelsmu/ktext
Utilities for preprocessing text for deep learning with Keras
asahi417/tner
Language model fine-tuning on NER with an easy interface and cross-domain evaluation. "T-NER: An...