Babelscape/ID10M
Data and code for the paper "ID10M: Idiom Identification in 10 Languages" (NAACL 2022).
This project helps natural language processing (NLP) researchers and linguists automatically identify idioms in text across 10 languages, including English, Spanish, and Chinese. It provides pre-trained models and extensive datasets, taking raw text as input and outputting annotations that highlight idiomatic expressions within sentences. This is a valuable resource for anyone working on multilingual text analysis or understanding figurative language.
No commits in the last 6 months.
Use this if you need to build or evaluate systems that can automatically detect idiomatic expressions in multiple languages for research or application development.
Not ideal if you are looking for an off-the-shelf application to explain idiom meanings or to translate idiomatic phrases in real-time.
Stars
8
Forks
4
Language
Python
License
—
Category
Last pushed
Feb 01, 2023
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/Babelscape/ID10M"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
luheng/deep_srl
Code and pre-trained model for: Deep Semantic Role Labeling: What Works and What's Next
sileod/tasksource
Datasets collection and preprocessings framework for NLP extreme multitask learning
loomchild/maligna
Bilingual sengence aligner
CK-Explorer/DuoSubs
Semantic subtitle aligner and merger for bilingual subtitle syncing.
coastalcph/lex-glue
LexGLUE: A Benchmark Dataset for Legal Language Understanding in English