softmax1/Flash-Attention-Softmax-N

CUDA and Triton implementations of Flash Attention with SoftmaxN.

42
/ 100
Emerging

This project offers efficient and numerically stable implementations of the `softmax_n` attention mechanism for transformer models. It takes your existing transformer model code or a pre-trained model as input and replaces the standard softmax with `softmax_n`, outputting a modified model that potentially has fewer activation and weight outliers. Machine learning engineers and researchers working on large language models and other transformer-based architectures would use this.

No commits in the last 6 months. Available on PyPI.

Use this if you are developing or fine-tuning transformer models and want to experiment with `softmax_n` to improve numerical stability or reduce outliers in model activations and weights.

Not ideal if you are not working with transformer models or if you need to use specific GPU features that are not supported by the Triton implementation, such as certain dropout or attention mask configurations with real-valued `n`.

transformer-models deep-learning model-optimization neural-networks machine-learning-research
Stale 6m
Maintenance 0 / 25
Adoption 9 / 25
Maturity 25 / 25
Community 8 / 25

How are scores calculated?

Stars

73

Forks

5

Language

Python

License

GPL-3.0

Last pushed

May 26, 2024

Commits (30d)

0

Dependencies

2

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/softmax1/Flash-Attention-Softmax-N"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.