sentencepiece and Tokenizer

SentencePiece is a standalone tokenization algorithm/implementation that OpenNMT/Tokenizer wraps and integrates alongside BPE as one of several supported tokenization backends within a broader translation framework.

sentencepiece
78
Verified
Tokenizer
55
Established
Maintenance 17/25
Adoption 15/25
Maturity 25/25
Community 21/25
Maintenance 6/25
Adoption 10/25
Maturity 16/25
Community 23/25
Stars: 11,697
Forks: 1,333
Downloads:
Commits (30d): 12
Language: C++
License: Apache-2.0
Stars: 330
Forks: 80
Downloads:
Commits (30d): 0
Language: C++
License: MIT
No risk flags
No Package No Dependents

About sentencepiece

google/sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.

This tool helps machine learning engineers prepare raw text data for training neural network-based text generation models. It takes your raw text (like sentences or documents) and breaks it down into smaller, consistent pieces (subword units) suitable for fixed-vocabulary models. You can then feed these standardized units into your neural network, streamlining the text preparation pipeline for natural language processing tasks.

natural-language-processing machine-translation text-generation text-preparation neural-networks

About Tokenizer

OpenNMT/Tokenizer

Fast and customizable text tokenization library with BPE and SentencePiece support

This is a versatile tool for language processing engineers, machine learning scientists, and data scientists who need to prepare raw text for analysis or model training. It takes raw text as input and breaks it down into individual words or subword units (tokens), which are the building blocks for natural language processing tasks. This allows you to precisely control how text is segmented and processed before it’s fed into your algorithms.

natural-language-processing machine-translation text-preprocessing language-modeling data-preparation

Scores updated daily from GitHub, PyPI, and npm data. How scores work