PhilipMay/stsb-multi-mt

Machine translated multilingual STS benchmark dataset.

/ 100

Emerging

This dataset provides pairs of sentences across multiple languages (like English, German, Spanish, etc.) along with a score indicating how semantically similar they are. It helps you train systems to understand if two sentences mean the same thing, even if they are worded differently or are in different languages. Anyone developing or evaluating multilingual natural language understanding models, especially for tasks like semantic search or question answering, would use this.

No commits in the last 6 months.

Use this if you need diverse, scored sentence pairs in various languages to train or benchmark models that measure sentence similarity.

Not ideal if you need datasets for tasks other than sentence similarity, or if you require perfect grammatical accuracy in all translated non-English datasets.

natural-language-processing multilingual-AI semantic-similarity machine-learning-training-data language-AI-evaluation

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 7 / 25

Maturity 16 / 25

Community 17 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

—

Higher-rated alternatives

luheng/deep_srl

Code and pre-trained model for: Deep Semantic Role Labeling: What Works and What's Next

sileod/tasksource

Datasets collection and preprocessings framework for NLP extreme multitask learning

loomchild/maligna

Bilingual sengence aligner

CK-Explorer/DuoSubs

Semantic subtitle aligner and merger for bilingual subtitle syncing.

coastalcph/lex-glue

LexGLUE: A Benchmark Dataset for Legal Language Understanding in English

Explore NLP Tools

All categories Trending NLP directory Insights