sayakpaul/count-tokens-hf-datasets

This project shows how to derive the total number of training tokens from a large text dataset from 🤗 datasets with Apache Beam and Dataflow.

/ 100

Experimental

This tool helps machine learning engineers and researchers accurately determine the total number of training tokens in very large text datasets from Hugging Face. You provide a dataset and a tokenizer, and it outputs a precise count of the tokens. This is crucial for understanding how large language models will behave during training.

No commits in the last 6 months.

Use this if you need to reliably count tokens across massive text datasets for training large language models.

Not ideal if you are working with small datasets or don't need highly precise token counts using distributed processing.

Natural Language Processing Large Language Models Dataset Preparation Machine Learning Engineering Cloud Computing

No License Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 7 / 25

Maturity 8 / 25

Community 4 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

—

Higher-rated alternatives

azukds/tubular

Python package implementing ML feature engineering and pre-processing for polars or pandas dataframes.

huggingface/course

The Hugging Face course on Transformers

huggingface/audio-transformers-course

The Hugging Face Course on Transformers for Audio

rickiepark/nlp-with-transformers

<트랜스포머를 활용한 자연어 처리> 예제 코드를 위한 저장소입니다.

NielsRogge/Transformers-Tutorials

This repository contains demos I made with the Transformers library by HuggingFace.

Explore Transformer Models

All categories Trending Transformer directory Insights