harleyszhang/lite_llama

A light llama-like llm inference framework based on the triton kernel.

42
/ 100
Emerging

This project helps machine learning engineers and researchers quickly run large language models (LLMs) like Llama3 and Qwen2.5 on their own hardware. It takes trained LLM weights as input and provides a significantly faster and more memory-efficient way to get text and image generation, and chat responses. The target users are those who deploy or experiment with LLMs on GPUs.

174 stars.

Use this if you need to run Llama-like large language models more quickly and with less GPU memory than standard frameworks, especially for generating text or interacting with models in a chat-like interface.

Not ideal if you are looking for a tool to train LLMs from scratch, or if you need to deploy models without access to NVIDIA or AMD GPUs.

Large Language Models MLOps GPU Optimization AI Inference Model Deployment
No License No Package No Dependents
Maintenance 6 / 25
Adoption 10 / 25
Maturity 8 / 25
Community 18 / 25

How are scores calculated?

Stars

174

Forks

27

Language

Python

License

Last pushed

Jan 05, 2026

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/harleyszhang/lite_llama"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.