harleyszhang/lite_llama

A light llama-like llm inference framework based on the triton kernel.

/ 100

Emerging

This project helps machine learning engineers and researchers quickly run large language models (LLMs) like Llama3 and Qwen2.5 on their own hardware. It takes trained LLM weights as input and provides a significantly faster and more memory-efficient way to get text and image generation, and chat responses. The target users are those who deploy or experiment with LLMs on GPUs.

174 stars.

Use this if you need to run Llama-like large language models more quickly and with less GPU memory than standard frameworks, especially for generating text or interacting with models in a chat-like interface.

Not ideal if you are looking for a tool to train LLMs from scratch, or if you need to deploy models without access to NVIDIA or AMD GPUs.

Large Language Models MLOps GPU Optimization AI Inference Model Deployment

No License No Package No Dependents

Maintenance 6 / 25

Adoption 10 / 25

Maturity 8 / 25

Community 18 / 25

How are scores calculated?

Stars

174

Forks

Language

Python

License

—

Higher-rated alternatives

hkproj/pytorch-llama

LLaMA 2 implemented from scratch in PyTorch

4AI/LS-LLaMA

A Simple but Powerful SOTA NER Model | Official Code For Label Supervised LLaMA Finetuning

luchangli03/export_llama_to_onnx

export llama to onnx

ayaka14732/llama-2-jax

JAX implementation of the Llama 2 model

liangyuwang/zo2

ZO2 (Zeroth-Order Offloading): Full Parameter Fine-Tuning 175B LLMs with 18GB GPU Memory [COLM2025]

Explore Transformer Models

All categories Trending Transformer directory Insights