bheinzerling/bpemb

Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE)

/ 100

Emerging

This project helps natural language processing (NLP) practitioners prepare text data for their machine learning models. It takes words or sentences in 275 different languages and outputs numerical representations (embeddings) of subword units. This is useful for researchers and developers building language-based AI systems, particularly for tasks involving less common languages or dealing with rare words.

1,221 stars. No commits in the last 6 months.

Use this if you need pre-trained subword embeddings in a wide range of languages for your neural network models, especially when full word embeddings are insufficient or unavailable.

Not ideal if your primary goal is basic text analysis without advanced machine learning models, or if you only work with a few major languages where robust word embeddings are already plentiful.

natural-language-processing text-analytics machine-translation information-retrieval computational-linguistics

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 18 / 25

How are scores calculated?

Stars

1,221

Forks

102

Language

Python

License

MIT

Featured in

Embeddings Are Easier Than Whatever You're Doing Instead You're Shipping AI You Can't Measure

Higher-rated alternatives

embeddings-benchmark/mteb

MTEB: Massive Text Embedding Benchmark

harmonydata/harmony

The Harmony Python library: a research tool for psychologists to harmonise data and...

yannvgn/laserembeddings

LASER multilingual sentence embeddings as a pip package

embeddings-benchmark/results

Data for the MTEB leaderboard

Hironsan/awesome-embedding-models

A curated list of awesome embedding models tutorials, projects and communities.

Explore Embedding Tools

All categories Trending Embeddings directory Insights