emiruz/textextract

textextract is a tiny library (87 lines of Go) that identifies where the article content is in a HTML page (as opposed to navigation, headers, footers, ads, etc), extracts it and returns it as a string. Like Boilerpipe but for Go in Go.

/ 100

Emerging

This library helps web crawlers and content aggregators by intelligently identifying and extracting only the main article content from any given HTML page. It takes raw HTML as input and outputs a clean, plain text string of the article, free from navigation, ads, or footers. This is ideal for developers building tools that need to analyze or process web content without distractions.

No commits in the last 6 months.

Use this if you are a developer building a system that needs to perform semantic analysis, classification, or any text-based processing on web article content.

Not ideal if you need to preserve the original HTML formatting, whitespace, or non-textual elements of the web page.

web-scraping content-extraction text-mining data-preparation information-retrieval

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 5 / 25

Maturity 16 / 25

Community 12 / 25

How are scores calculated?

Stars

Forks

Language

License

MIT

Higher-rated alternatives

ikawaha/kagome-dict

Dictionary Library for Kagome v2

aaaton/golem

A lemmatizer implemented in Go

habeanf/yap

Yet Another (natural language) Parser

clipperhouse/uax29

A tokenizer based on Unicode text segmentation (UAX #29), for Go. Split graphemes, words, sentences.

abadojack/whatlanggo

Natural language detection library for Go

Explore NLP Tools

All categories Trending NLP directory Insights