emiruz/textextract
textextract is a tiny library (87 lines of Go) that identifies where the article content is in a HTML page (as opposed to navigation, headers, footers, ads, etc), extracts it and returns it as a string. Like Boilerpipe but for Go in Go.
This library helps web crawlers and content aggregators by intelligently identifying and extracting only the main article content from any given HTML page. It takes raw HTML as input and outputs a clean, plain text string of the article, free from navigation, ads, or footers. This is ideal for developers building tools that need to analyze or process web content without distractions.
No commits in the last 6 months.
Use this if you are a developer building a system that needs to perform semantic analysis, classification, or any text-based processing on web article content.
Not ideal if you need to preserve the original HTML formatting, whitespace, or non-textual elements of the web page.
Stars
11
Forks
2
Language
Go
License
MIT
Category
Last pushed
Oct 15, 2018
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/emiruz/textextract"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
ikawaha/kagome-dict
Dictionary Library for Kagome v2
aaaton/golem
A lemmatizer implemented in Go
habeanf/yap
Yet Another (natural language) Parser
clipperhouse/uax29
A tokenizer based on Unicode text segmentation (UAX #29), for Go. Split graphemes, words, sentences.
abadojack/whatlanggo
Natural language detection library for Go