XunhaoLai/ring-sliding-window-attention

Ring sliding window attention implementation with flash attention

/ 100

Experimental

This is a specialized tool for machine learning engineers working on large language models. It helps train models more efficiently on very long text sequences by distributing the attention mechanism across multiple GPUs. You input the model's query, key, and value tensors, and it outputs the attention results, enabling faster training for long contexts.

No commits in the last 6 months.

Use this if you are a machine learning engineer or researcher training large language models with very long input sequences and need to leverage multiple GPUs for efficient computation.

Not ideal if you are working with shorter text sequences, or if you are not using a distributed training setup with multiple GPUs.

large-language-models distributed-training deep-learning natural-language-processing GPU-acceleration

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 5 / 25

Maturity 15 / 25

Community 0 / 25

How are scores calculated?

Stars

Forks

—

Language

Python

License

MIT

Compare

ring-sliding-window-attention and flash-linear-attention

Higher-rated alternatives

fla-org/flash-linear-attention

🚀 Efficient implementations of state-of-the-art linear attention models

thu-ml/SageAttention

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x...

thu-ml/SpargeAttn

[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.

fla-org/flame

🔥 A minimal training framework for scaling FLA models

foundation-model-stack/fms-fsdp

🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for...

Explore Transformer Models

All categories Trending Transformer directory Insights