dariush-bahrami/character-tokenizer
A character tokenizer for Hugging Face Transformers
This tool helps developers working with natural language processing (NLP) models. It converts text into individual characters, which are then represented as numerical inputs for machine learning models. The output is a format suitable for Hugging Face Transformer models, enabling more granular text analysis. This is ideal for machine learning engineers and NLP researchers who need fine-grained control over text processing.
No commits in the last 6 months.
Use this if you are an NLP developer who needs to process text character by character for a Hugging Face Transformer model.
Not ideal if you are an end-user looking for a ready-to-use application, rather than a developer tool.
Stars
32
Forks
13
Language
Python
License
MIT
Category
Last pushed
Jun 21, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/dariush-bahrami/character-tokenizer"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
google/sentencepiece
Unsupervised text tokenizer for Neural Network-based text generation.
daac-tools/vibrato
🎤 vibrato: Viterbi-based accelerated tokenizer
OpenNMT/Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support
Systemcluster/kitoken
Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers,...
daac-tools/vaporetto
🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer