camail-official/LinearAttentionPruning
This is the official repository for the pre-print "The Key to State Reduction in Linear Attention: A Rank-based Perspective"
This tool helps machine learning engineers and researchers make their large language models (specifically DeltaNet and Gated DeltaNet architectures) more efficient. It takes an existing linear attention model, reduces its internal complexity (Q/K dimensions), and outputs a smaller, faster model. The goal is to achieve similar performance with significantly less computational cost.
Use this if you need to deploy large language models with linear attention layers more efficiently, reducing their memory footprint and increasing inference speed while minimizing performance degradation.
Not ideal if you are working with transformer architectures that do not use linear attention, or if your primary goal is to improve model accuracy rather than efficiency.
Stars
9
Forks
—
Language
Python
License
—
Category
Last pushed
Feb 10, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/ml-frameworks/camail-official/LinearAttentionPruning"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
open-mmlab/mmengine
OpenMMLab Foundational Library for Training Deep Learning Models
Xilinx/brevitas
Brevitas: neural network quantization in PyTorch
google/qkeras
QKeras: a quantization deep learning library for Tensorflow Keras
fastmachinelearning/qonnx
QONNX: Arbitrary-Precision Quantized Neural Networks in ONNX
tensorflow/model-optimization
A toolkit to optimize ML models for deployment for Keras and TensorFlow, including quantization...