megagonlabs/ginza-transformers
Use custom tokenizers in spacy-transformers
This project helps developers working with natural language processing (NLP) to integrate specialized or custom text segmentation tools with their spaCy v3 transformer models. It allows you to use tokenizers not directly from Hugging Face Transformers, ensuring your models process text with the exact word and subword divisions required for specific languages or domains. Developers building custom NLP pipelines for unique text structures would use this.
No commits in the last 6 months. Available on PyPI.
Use this if you need to use a custom text tokenizer with your spaCy v3 transformer pipeline that isn't available directly through Hugging Face's default library.
Not ideal if your NLP workflow relies solely on standard tokenizers already supported by Hugging Face Transformers and spaCy.
Stars
16
Forks
5
Language
Python
License
MIT
Category
Last pushed
Aug 09, 2022
Commits (30d)
0
Dependencies
1
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/megagonlabs/ginza-transformers"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
huggingface/tokenizers
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
Kaleidophon/token2index
A lightweight but powerful library to build token indices for NLP tasks, compatible with major...
Hugging-Face-Supporter/tftokenizers
Use Huggingface Transformer and Tokenizers as Tensorflow Reusable SavedModels
NVIDIA/Cosmos-Tokenizer
A suite of image and video neural tokenizers
wangcongcong123/ttt
A package for fine-tuning Transformers with TPUs, written in Tensorflow2.0+