georg-jung/FastBertTokenizer
Fast and memory-efficient library for WordPiece tokenization as it is used by BERT.
This tool helps AI developers working with .NET process large amounts of text data efficiently for BERT models. It takes raw text as input and converts it into numerical tokens, along with attention masks and token type IDs, which are ready for machine learning models. The ideal user is a developer building AI applications or services in a .NET environment that rely on BERT's text processing capabilities.
Use this if you need to quickly and memory-efficiently prepare text for BERT models within a .NET application, especially when processing large datasets.
Not ideal if your AI application is not built on .NET or if you require tokenization support for two separate text inputs with a separator.
Stars
53
Forks
11
Language
C#
License
MIT
Category
Last pushed
Nov 16, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/georg-jung/FastBertTokenizer"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
ml-rust/splintr
A high-performance tokenizer (BPE + SentencePiece) built with Rust with Python bindings, focused...
sanderland/script_tok
Code for the paper "BPE stays on SCRIPT"
ash-01xor/bpe.c
Simple Byte pair Encoding mechanism used for tokenization process . written purely in C
U4RASD/r-bpe
R-BPE: Improving BPE-Tokenizers with Token Reuse
jmaczan/bpe-tokenizer
Byte-Pair Encoding tokenizer for training large language models on huge datasets