Helsinki-NLP/OpusFilter
OpusFilter - Parallel corpus processing toolkit
OpusFilter helps natural language processing researchers and developers prepare large collections of text for machine translation and other multilingual tasks. It takes raw parallel text corpora (texts aligned sentence-by-sentence in two or more languages) and outputs cleaned, filtered, and combined versions, ready for training. This is ideal for those building or evaluating multilingual AI models.
115 stars. Available on PyPI.
Use this if you need to clean and combine massive, noisy parallel text datasets efficiently to improve the quality of your machine translation or cross-lingual models.
Not ideal if you're working with single-language texts or smaller, carefully curated datasets that don't require extensive automated filtering and large-scale combination.
Stars
115
Forks
26
Language
Python
License
MIT
Category
Last pushed
Feb 11, 2026
Commits (30d)
0
Dependencies
20
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/Helsinki-NLP/OpusFilter"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
natasha/corus
Links to Russian corpora + Python functions for loading and parsing
darija-open-dataset/dataset
darija <-> english dataset
omicsNLP/Auto-CORPus
Auto-CORPus pipeline developed by a University of Nottingham and Imperial College London...
SergeyShk/ruTS
Библиотека для извлечения статистик из текстов на русском языке.
texttechnologylab/GerParCor
German Parliamentary Corpus (GerParCor)