harleyszhang/lite_llama
A light llama-like llm inference framework based on the triton kernel.
This project helps machine learning engineers and researchers quickly run large language models (LLMs) like Llama3 and Qwen2.5 on their own hardware. It takes trained LLM weights as input and provides a significantly faster and more memory-efficient way to get text and image generation, and chat responses. The target users are those who deploy or experiment with LLMs on GPUs.
174 stars.
Use this if you need to run Llama-like large language models more quickly and with less GPU memory than standard frameworks, especially for generating text or interacting with models in a chat-like interface.
Not ideal if you are looking for a tool to train LLMs from scratch, or if you need to deploy models without access to NVIDIA or AMD GPUs.
Stars
174
Forks
27
Language
Python
License
—
Category
Last pushed
Jan 05, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/harleyszhang/lite_llama"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
hkproj/pytorch-llama
LLaMA 2 implemented from scratch in PyTorch
4AI/LS-LLaMA
A Simple but Powerful SOTA NER Model | Official Code For Label Supervised LLaMA Finetuning
luchangli03/export_llama_to_onnx
export llama to onnx
ayaka14732/llama-2-jax
JAX implementation of the Llama 2 model
liangyuwang/zo2
ZO2 (Zeroth-Order Offloading): Full Parameter Fine-Tuning 175B LLMs with 18GB GPU Memory [COLM2025]