emiruz/textextract

textextract is a tiny library (87 lines of Go) that identifies where the article content is in a HTML page (as opposed to navigation, headers, footers, ads, etc), extracts it and returns it as a string. Like Boilerpipe but for Go in Go.

33
/ 100
Emerging

This library helps web crawlers and content aggregators by intelligently identifying and extracting only the main article content from any given HTML page. It takes raw HTML as input and outputs a clean, plain text string of the article, free from navigation, ads, or footers. This is ideal for developers building tools that need to analyze or process web content without distractions.

No commits in the last 6 months.

Use this if you are a developer building a system that needs to perform semantic analysis, classification, or any text-based processing on web article content.

Not ideal if you need to preserve the original HTML formatting, whitespace, or non-textual elements of the web page.

web-scraping content-extraction text-mining data-preparation information-retrieval
Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 5 / 25
Maturity 16 / 25
Community 12 / 25

How are scores calculated?

Stars

11

Forks

2

Language

Go

License

MIT

Category

go-nlp-libraries

Last pushed

Oct 15, 2018

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/emiruz/textextract"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.