technion-cs-nlp/BiologicalTokenizers

Effect of tokenization on transformers for biological sequence

/ 100

Experimental

This project helps bioinformaticians and computational biologists improve the accuracy and efficiency of deep learning models when working with long biological sequences like DNA or protein data. It takes raw biological sequences as input and outputs optimized 'tokenized' versions, significantly reducing sequence length while boosting model performance. This is for researchers and scientists who use transformer models for tasks like protein function prediction or sequence alignment.

Use this if you are building or training deep learning models on biological sequence data and need to optimize input representation for better accuracy and faster processing.

Not ideal if you are not working with deep learning models, particularly transformer architectures, or if your biological sequence analysis doesn't involve complex prediction or classification tasks.

bioinformatics genomics protein-function-prediction sequence-alignment computational-biology

No Package No Dependents

Maintenance 6 / 25

Adoption 6 / 25

Maturity 16 / 25

Community 0 / 25

How are scores calculated?

Stars

Forks

—

Language

Python

License

MIT

Featured in

We Audited crewAI's AI Dependencies: Here's What the Data Says

Higher-rated alternatives

huggingface/tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

megagonlabs/ginza-transformers

Use custom tokenizers in spacy-transformers

Kaleidophon/token2index

A lightweight but powerful library to build token indices for NLP tasks, compatible with major...

Hugging-Face-Supporter/tftokenizers

Use Huggingface Transformer and Tokenizers as Tensorflow Reusable SavedModels

NVIDIA/Cosmos-Tokenizer

A suite of image and video neural tokenizers

Explore Transformer Models

All categories Trending Transformer directory Insights