bheinzerling/bpemb
Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE)
This project helps natural language processing (NLP) practitioners prepare text data for their machine learning models. It takes words or sentences in 275 different languages and outputs numerical representations (embeddings) of subword units. This is useful for researchers and developers building language-based AI systems, particularly for tasks involving less common languages or dealing with rare words.
1,221 stars. No commits in the last 6 months.
Use this if you need pre-trained subword embeddings in a wide range of languages for your neural network models, especially when full word embeddings are insufficient or unavailable.
Not ideal if your primary goal is basic text analysis without advanced machine learning models, or if you only work with a few major languages where robust word embeddings are already plentiful.
Stars
1,221
Forks
102
Language
Python
License
MIT
Category
Last pushed
Oct 01, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/embeddings/bheinzerling/bpemb"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Higher-rated alternatives
embeddings-benchmark/mteb
MTEB: Massive Text Embedding Benchmark
harmonydata/harmony
The Harmony Python library: a research tool for psychologists to harmonise data and...
yannvgn/laserembeddings
LASER multilingual sentence embeddings as a pip package
embeddings-benchmark/results
Data for the MTEB leaderboard
Hironsan/awesome-embedding-models
A curated list of awesome embedding models tutorials, projects and communities.