tushar117/XAlign
Cross-lingual Fact-to-Text Alignment and Generation for Low-Resource Languages
This project helps natural language processing practitioners create datasets that link structured facts (like "Boris Pasternak" won "Nobel Prize in Literature") to sentences in low-resource languages (like Hindi or Tamil). It takes English Wikidata facts and aligns them with corresponding sentences from Wikipedia in less common languages. The output is a structured dataset containing native language sentences paired with their relevant facts and language identifiers, useful for training AI models.
No commits in the last 6 months.
Use this if you need to build or expand knowledge graphs and natural language generation models for languages not well-represented in existing datasets.
Not ideal if you are working with high-resource languages or only need monolingual fact-to-text alignment.
Stars
11
Forks
1
Language
Python
License
MIT
Category
Last pushed
Jan 01, 2023
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/tushar117/XAlign"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
DerwenAI/pytextrank
Python implementation of TextRank algorithms ("textgraphs") for phrase extraction
Tiiiger/bert_score
BERT score for text generation
BrikerMan/Kashgari
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for...
asyml/texar
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. ...
yohasebe/wp2txt
A command-line tool to extract plain text from Wikipedia dumps with category and section filtering