georg-jung/FastBertTokenizer

Fast and memory-efficient library for WordPiece tokenization as it is used by BERT.

/ 100

Emerging

This tool helps AI developers working with .NET process large amounts of text data efficiently for BERT models. It takes raw text as input and converts it into numerical tokens, along with attention masks and token type IDs, which are ready for machine learning models. The ideal user is a developer building AI applications or services in a .NET environment that rely on BERT's text processing capabilities.

Use this if you need to quickly and memory-efficiently prepare text for BERT models within a .NET application, especially when processing large datasets.

Not ideal if your AI application is not built on .NET or if you require tokenization support for two separate text inputs with a separator.

natural-language-processing machine-learning-engineering text-pre-processing dotnet-development ai-application-development

No Package No Dependents

Maintenance 6 / 25

Adoption 8 / 25

Maturity 16 / 25

Community 17 / 25

How are scores calculated?

Stars

Forks

Language

License

MIT

Related tools

ml-rust/splintr

A high-performance tokenizer (BPE + SentencePiece) built with Rust with Python bindings, focused...

sanderland/script_tok

Code for the paper "BPE stays on SCRIPT"

ash-01xor/bpe.c

Simple Byte pair Encoding mechanism used for tokenization process . written purely in C

U4RASD/r-bpe

R-BPE: Improving BPE-Tokenizers with Token Reuse

jmaczan/bpe-tokenizer

Byte-Pair Encoding tokenizer for training large language models on huge datasets

Explore NLP Tools

All categories Trending NLP directory Insights