ruanchaves/hashformers

Accurate word segmentation for hashtags and text, powered by Transformers and Beam Search. A scalable alternative to heuristic splitters and massive LLMs.

48
/ 100
Emerging

When analyzing social media or any text that's missing spaces between words—like #weneedanationalpark or #москвасити—this tool accurately splits them into individual, readable words. It takes unsegmented text strings and outputs correctly segmented phrases. This is for data scientists, social media analysts, or NLP researchers who need to clean and prepare text data for further analysis in any language.

Available on PyPI.

Use this if you need to precisely segment text like hashtags or concatenated words at scale, especially when working with various languages or niche vocabularies where pre-built dictionaries are insufficient.

Not ideal if your main concerns are very low latency and extremely high scalability where even small language models are too slow, or if you only need to segment a very small volume of items.

social-media-analysis text-preprocessing natural-language-processing data-cleaning multilingual-text
Maintenance 6 / 25
Adoption 9 / 25
Maturity 25 / 25
Community 8 / 25

How are scores calculated?

Stars

77

Forks

5

Language

Python

License

MIT

Last pushed

Jan 08, 2026

Commits (30d)

0

Dependencies

3

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/ruanchaves/hashformers"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.