helpmefindaname/transformer-smaller-training-vocab

Temporary remove unused tokens during training to save ram and speed.

48
/ 100
Emerging

This tool helps machine learning engineers or researchers who are fine-tuning large language models to save memory and speed up training. It takes your pre-trained transformer model and training dataset, and temporarily reduces the model's vocabulary to only include tokens present in your data. This results in faster training and lower GPU memory usage, while still allowing you to save the full model afterward.

Used by 1 other package. No commits in the last 6 months. Available on PyPI.

Use this if you are training a transformer model and notice that many tokens in the model's full vocabulary are not actually used in your specific training data, causing unnecessary memory consumption and slower training.

Not ideal if you are using 'slow' tokenizers other than XLMRobertaTokenizer, RobertaTokenizer, or BertTokenizer, as support is limited for these.

natural-language-processing machine-learning-engineering deep-learning-optimization transformer-training large-language-models
Stale 6m
Maintenance 2 / 25
Adoption 7 / 25
Maturity 25 / 25
Community 14 / 25

How are scores calculated?

Stars

23

Forks

4

Language

Python

License

MIT

Last pushed

Jun 15, 2025

Commits (30d)

0

Dependencies

2

Reverse dependents

1

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/helpmefindaname/transformer-smaller-training-vocab"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.