facebookresearch/stopes

A library for preparing data for machine translation research (monolingual preprocessing, bitext mining, etc.) built by the FAIR NLLB team.

64
/ 100
Established

This library helps machine translation researchers prepare vast amounts of text for training translation models. It takes raw web data and large collections of texts in different languages to produce clean, single-language datasets and aligned sentence pairs across languages. It's designed for researchers building and evaluating machine translation systems, especially those working with many languages.

297 stars. Available on PyPI.

Use this if you are a machine translation researcher needing to preprocess web-scale monolingual data or mine parallel sentence pairs to train your models efficiently.

Not ideal if you are looking for a pre-trained machine translation model or a ready-to-use translation API for end-user applications.

machine-translation natural-language-processing text-preprocessing bitext-mining linguistic-data-preparation
Maintenance 10 / 25
Adoption 10 / 25
Maturity 25 / 25
Community 19 / 25

How are scores calculated?

Stars

297

Forks

45

Language

Python

License

MIT

Last pushed

Mar 12, 2026

Commits (30d)

0

Dependencies

6

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/facebookresearch/stopes"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.