facebookresearch/stopes
A library for preparing data for machine translation research (monolingual preprocessing, bitext mining, etc.) built by the FAIR NLLB team.
This library helps machine translation researchers prepare vast amounts of text for training translation models. It takes raw web data and large collections of texts in different languages to produce clean, single-language datasets and aligned sentence pairs across languages. It's designed for researchers building and evaluating machine translation systems, especially those working with many languages.
297 stars. Available on PyPI.
Use this if you are a machine translation researcher needing to preprocess web-scale monolingual data or mine parallel sentence pairs to train your models efficiently.
Not ideal if you are looking for a pre-trained machine translation model or a ready-to-use translation API for end-user applications.
Stars
297
Forks
45
Language
Python
License
MIT
Category
Last pushed
Mar 12, 2026
Commits (30d)
0
Dependencies
6
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/facebookresearch/stopes"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
Droidtown/ArticutAPI
API of Articut 中文斷詞 (兼具語意詞性標記):「斷詞」又稱「分詞」,是中文資訊處理的基礎。Articut 不用機器學習,不需資料模型,只用現代白話中文語法規則,即能達到...
rkcosmos/deepcut
A Thai word tokenization library using Deep Neural Network
fukuball/jieba-php
"結巴"中文分詞:做最好的 PHP 中文分詞、中文斷詞組件。 / "Jieba" (Chinese for "to stutter") Chinese text segmentation:...
pytorch/text
Models, data loaders and abstractions for language processing, powered by PyTorch
jiesutd/NCRFpp
NCRF++, a Neural Sequence Labeling Toolkit. Easy use to any sequence labeling tasks (e.g. NER,...