Text Alignment Systems NLP Tools

Tools for aligning texts across languages, documents, or modalities (word-level, sentence-level, or document-level). Includes cross-lingual alignment, monolingual alignment, and narrative/script synchronization. Does NOT include general translation, similarity matching without explicit alignment output, or semantic parsing.

There are 97 text alignment systems tools tracked. The highest-rated is luheng/deep_srl at 49/100 with 334 stars.

Get all 97 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=nlp&subcategory=text-alignment-systems&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 luheng/deep_srl

Code and pre-trained model for: Deep Semantic Role Labeling: What Works and...

49
Emerging
2 sileod/tasksource

Datasets collection and preprocessings framework for NLP extreme multitask learning

48
Emerging
3 loomchild/maligna

Bilingual sengence aligner

46
Emerging
4 CK-Explorer/DuoSubs

Semantic subtitle aligner and merger for bilingual subtitle syncing.

41
Emerging
5 coastalcph/lex-glue

LexGLUE: A Benchmark Dataset for Legal Language Understanding in English

40
Emerging
6 ChineseGLUE/ChineseGLUE

Language Understanding Evaluation benchmark for Chinese: datasets,...

40
Emerging
7 gkiril/benchie

Comprehensive evaluation framework for Open Information Extraction.

40
Emerging
8 PhilipMay/stsb-multi-mt

Machine translated multilingual STS benchmark dataset.

40
Emerging
9 naver-ai/korean-safety-benchmarks

Official datasets and pytorch implementation repository of SQuARe and KoSBi...

39
Emerging
10 scofield7419/HeSyFu

Code for the ACL2021 paper: Better Combine Them Together! Integrating...

38
Emerging
11 IINemo/isanlp_srl_framebank

SRL parser for Russian based on FrameBank corpus

37
Emerging
12 vecto-ai/word-benchmarks

Benchmarks for intrinsic word embeddings evaluation.

36
Emerging
13 TalSchuster/CrossLingualContextualEmb

Cross-Lingual Alignment of Contextual Word Embeddings

36
Emerging
14 ardoco/benchmark

A benchmark repository for TLR between (textual) Software Architecture...

36
Emerging
15 ubisoft/ubisoft-laforge-binaryalign

BinaryAlign: Word Alignment as Binary Sequence Labeling

35
Emerging
16 UKPLab/eacl2026-abcd-link

Repository for reproducing results from ABCD-Link

35
Emerging
17 Babelscape/ID10M

Data and code for the paper "ID10M: Idiom Identification in 10 Languages"...

35
Emerging
18 cdli-gh/Semantic-Role-Labeler

A semantic role labeling system for the Sumerian language. A Google Summer...

35
Emerging
19 SapienzaNLP/gsrl

GSRL is a seq2seq model for end-to-end dependency- and span-based SRL (IJCAI2021).

34
Emerging
20 GuillaumeDD/dialign

Automatic and generic measures of verbal alignment in dyadic dialogue based...

34
Emerging
21 Babelscape/CroCoAlign

A Cross-Lingual, Context-Aware and Fully-Neural Sentence Alignment System...

34
Emerging
22 ku-nlp/JKUSea

Utilitary tool aligning sentences of texts written in 2 different languages.

33
Emerging
23 thunlp/DictSKB

Code and data of the paper "Automatic Construction of Sememe Knowledge Bases...

33
Emerging
24 qiyuw/WSPAlign

WSPAlign: Word Alignment Pre-training via Large-Scale Weakly Supervised Span...

32
Emerging
25 doc-analysis/XFUND

XFUND: A Multilingual Form Understanding Benchmark

32
Emerging
26 LaVi-Lab/CLEVA

[EMNLP 2023 Demo] "CLEVA: Chinese Language Models EVAluation Platform"

32
Emerging
27 tschomacker/aligned-narrative-documents

A collection of scripts to create a Document-aligned corpus of German...

31
Emerging
28 scofield7419/LAGCN-SRL

Codes for the AAAI 2021 paper: Encoder-Decoder Based Unified Semantic Role...

31
Emerging
29 tyjiangU/fido

Code for the paper "Exploiting Definitions for Frame Identification"

31
Emerging
30 amazon-science/real-world-noisy-benchmarks-for-natural-language-understanding

Benchmark test sets for real-world noise phenomena in goal-directed...

31
Emerging
31 thespectrewithin/joint_align

Cross-lingual Alignment vs Joint Training: A Comparative Study and A Simple...

31
Emerging
32 orzhan/rusimscore

Code for paper "RuSimScore: unsupervised scoring function for Russian...

31
Emerging
33 UKPLab/acl2024-ircoder

Data creation, training and eval scripts for the IRCoder paper

30
Emerging
34 strubell/preprocess-conll05

Scripts for preprocessing the CoNLL-2005 SRL dataset.

30
Emerging
35 luciusssss/MiLiC-Eval

[ACL'25 Findings] MiLiC-Eval: Benchmarking Multilingual LLMs for China's...

30
Emerging
36 p-lambda/swords

The Stanford Word Substitution (Swords) Benchmark

30
Emerging
37 SapienzaNLP/dsrl

Code for "Semantic Role Labeling meets Definition Modeling: using natural...

29
Experimental
38 rggdmonk/hadal

A simple and efficient tool for mining and aligning sentences with pre-trained models.

29
Experimental
39 google/BEGIN-dataset

A benchmark dataset for evaluating dialog system and natural language...

29
Experimental
40 allenai/multicite

MultiCite code and data. Models are available on Huggingface.

28
Experimental
41 Tixierae/WECD

Code and data for the paper: 'Word Embeddings for the Construction Domain'

28
Experimental
42 v-hirak/explaining-MT-difficulty

Dataset of diverse typological language properties as part of "Assessing the...

27
Experimental
43 ryokamoi/wice

This repository contains the dataset and code for "WiCE: Real-World...

27
Experimental
44 longxudou/multispider

MultiSpider: Towards Benchmarking Multilingual Text-to-SQL Semantic Parsing

27
Experimental
45 lyutyuh/structured-span-selector

A Structured Span Selector (NAACL 2022). A structured span selector with a...

26
Experimental
46 liutianlin0121/decoding-time-realignment

Implementation of "Decoding-time Realignment of Language Models", ICML 2024.

25
Experimental
47 jacklxc/CORWA

CORWA: A Citation-Oriented Related Work Annotation Dataset, NAACL 2022

25
Experimental
48 ShiZhengyan/IngredientParsing

Dataset and pytorch codes for the paper titled "Attention-based Ingredient...

25
Experimental
49 cvjena/chiasmus-detector

Code for paper "Data-Driven Detection of General Chiasmi Using Lexical and...

24
Experimental
50 Sam120204/Pluralistic-Alignment-for-Healthcare

Code of our paper - "Pluralistic Alignment for Healthcare: A Role-Driven...

24
Experimental
51 guilhermevarela/deep_srlbr

SRL task using PropBank 1.1

23
Experimental
52 garfieldpigljy/CrowdWSA2019

Crowdsourced Word Sequence Aggregation 2019

23
Experimental
53 yumoxu/detnet

Code and dataset for TACL 19: Weakly Supervised Domain Detection.

22
Experimental
54 Botfuel/benchmark-nlp

NLP benchmark test sentences and full results

21
Experimental
55 samchengcs/IKEA-Dataset

A dataset for multimodal machine translation

21
Experimental
56 tsar-workshop/tsar-2025-shared-task

Code and data for TSAR 2025 Shared Task

21
Experimental
57 ZurichNLP/ConLoan

A Contrastive Multilingual Dataset for Evaluating Loanwords - ACL2025

20
Experimental
58 nikolayVv/MultiParaphrase

Comparing and evaluating monolingual paraphrasing of English, German, Czech,...

20
Experimental
59 pranav-ust/cognates

ACL SRW paper: Alignment Analysis of Sequential Segmentation of Lexicons to...

20
Experimental
60 DominiqueMercier/ImpactCite

ImpactCite: A XLNet-based Solution Enabling Qualitative CitationImpact...

20
Experimental
61 SapienzaNLP/conception

Code and experiments for the COLING2020 paper "Conception:...

20
Experimental
62 kukas/word-alignment-visualization

Word Alignment Visualization is a Python package for visualizing word...

20
Experimental
63 sileod/metaeval

Collection of tasks for meta-learning and extreme multitask learning

20
Experimental
64 SapienzaNLP/srl-pas-probing

Probing for Predicate Argument Structures in Pretrained Language Models (ACL 2022).

20
Experimental
65 gling07/Text2DRS

System Text2Drs takes English narrative as an input and outputs a discourse...

20
Experimental
66 maxkagamine/word-alignment-demo

Demonstration of AI/neural word alignment of English & Japanese text using...

19
Experimental
67 SapienzaNLP/united-srl

A unified dataset for span- and dependency-based multilingual and...

19
Experimental
68 qiyuw/WSPAlign.InferEval

Inference library and evaluation script for WSPAlign...

19
Experimental
69 ghomasHudson/muld

The Multitask Long Document Benchmark

19
Experimental
70 SapienzaNLP/usea

Universal Semantic Annotator (LREC 2022)

19
Experimental
71 mbanon/benchmarks

Several benchmarks on sentence splitting and language identification

19
Experimental
72 SapienzaNLP/exploring-srl

Repository for the paper "Exploring Non-Verbal Predicates in Semantic Role...

19
Experimental
73 hexuandeng/HExp4UDS

Implementation of the paper “Holistic Exploration on Universal...

19
Experimental
74 SapienzaNLP/unify-srl

Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic...

19
Experimental
75 okalai-ai/moimoe

Typology-Guided Adaption in Multilingual Models

19
Experimental
76 joshstephenson/SEAS

Tools for extracting and aligning sentences from subtitle language pairs...

19
Experimental
77 DorinK/Principal-Parts-Detection

Multilingual dataset for principal parts detection in inflectional...

18
Experimental
78 hmosousa/professor_heideltime

Create a multilingual corpus weakly labeled with HeidelTime.

17
Experimental
79 agneknie/com4520DarwinProject

Adjacent code related to the paper prepared for Joint Workshop on Multiword...

17
Experimental
80 bMagicLAB/human-alignment-pl-en-codeswitch

Human-in-the-Loop alignment dataset for Polish-English code-switching...

15
Experimental
81 Toavinarandrianarivo/Scene2Chapter-NLP-Aligner

📖 Align movie scripts with novel chapters seamlessly using advanced NLP...

14
Experimental
82 Youggls/ACROSS-ACL23

Official code repo for paper: ACROSS: An Alignment-based Framework for...

13
Experimental
83 multilingual-dataset-survey/multilingual-dataset-survey.github.io

The website implementation of Findings of EMNLP 2022, "Beyond Counting...

13
Experimental
84 xiaomeng-zhu/LIEDER

Repository for the ACL 2024 paper "LIEDER: Linguistically-Informed...

12
Experimental
85 heyjoonkim/APA

Pytorch implementation of "Aligning Language Models to Explicitly Handle...

12
Experimental
86 kinit-sk/multiclaim

MultiClaim dataset repository

12
Experimental
87 seinecle/umibench

Testbench for sentiment and factuality in texts.

11
Experimental
88 INTERACT-LLM/alignment-drift-llms

Dataset and analysis code for BEA2025 paper @ ACL: "Alignment Drift in...

11
Experimental
89 squirridge/omod

orthographic mapping ondemand dataset

11
Experimental
90 NUS-IDS/CW-CURE

This is the official data repository for the following CIKM 2022 paper from...

11
Experimental
91 MrShininnnnn/CECW

This repository is for the Colorful Extended Cleanup World (CECW) dataset, a...

11
Experimental
92 da03/Epanadiplosis_Benchmark

Benchmarking the performance of various language models in generating...

11
Experimental
93 zahra-parvizian/PersianLexicalSimplifier

Persian text simplification using lexical simplification

11
Experimental
94 BasRizk/DatasetAligner

Generating variant of TV-shows based labelled data-set in language B from...

10
Experimental
95 oooranz/MonoAlign

Unsupervised monolingual word aligner

10
Experimental
96 minnesotanlp/taddex

Code and dataset for Martin et al's paper "Complex Mathematical Symbol...

10
Experimental
97 ocramz/nlp-data-superglue

Dataset parsers from the SuperGLUE benchmark https://super.gluebenchmark.com/tasks/

10
Experimental