symanto-research/merge-tokenizers

Package to align tokens from different tokenizations.

/ 100

Experimental

When working with text data, you often need to compare or combine information from different text analysis models, but these models might break down the same text into words or sub-words in slightly different ways. This tool helps you accurately map and align these differing token lists back to each other, even when one model splits a word into multiple pieces and another keeps it whole. It's for anyone building or using advanced natural language processing (NLP) systems, especially those integrating outputs from various language models.

No commits in the last 6 months.

Use this if you need to precisely connect corresponding words or word fragments (tokens) from the output of two different text processing systems or language models.

Not ideal if you only ever use a single text processing system, or if approximate, rather than precise, alignment between tokenizations is acceptable for your task.

natural-language-processing text-analysis language-model-integration data-alignment computational-linguistics

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 6 / 25

Maturity 16 / 25

Community 0 / 25

How are scores calculated?

Stars

Forks

—

Language

Python

License

—

Featured in

We Audited crewAI's AI Dependencies: Here's What the Data Says

Higher-rated alternatives

huggingface/tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

megagonlabs/ginza-transformers

Use custom tokenizers in spacy-transformers

Kaleidophon/token2index

A lightweight but powerful library to build token indices for NLP tasks, compatible with major...

Hugging-Face-Supporter/tftokenizers

Use Huggingface Transformer and Tokenizers as Tensorflow Reusable SavedModels

NVIDIA/Cosmos-Tokenizer

A suite of image and video neural tokenizers

Explore Transformer Models

All categories Trending Transformer directory Insights