AI-Hypercomputer/JetStream
JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs welcome).
This project helps machine learning engineers and researchers efficiently run large language models (LLMs) on specialized hardware like Google's TPUs. It takes your trained LLM (built with frameworks like JAX or PyTorch) and outputs predictions or generated text much faster and using less memory, even under heavy demand. It's designed for those who need to deploy and serve LLMs to end-users at scale.
415 stars.
Use this if you are a machine learning engineer or researcher looking to optimize the performance, speed, and memory usage of your large language models when running them on XLA devices such as TPUs, especially for high-throughput serving.
Not ideal if you are a business user or data analyst without a technical background in machine learning deployment and infrastructure, or if you are primarily working with standard CPU or GPU environments without XLA-specific optimization needs.
Stars
415
Forks
58
Language
Python
License
Apache-2.0
Category
Last pushed
Jan 05, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/AI-Hypercomputer/JetStream"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related models
vllm-project/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
sgl-project/sglang
SGLang is a high-performance serving framework for large language models and multimodal models.
alibaba/MNN
MNN: A blazing-fast, lightweight inference engine battle-tested by Alibaba, powering...
xorbitsai/inference
Swap GPT for any LLM by changing a single line of code. Xinference lets you run open-source,...
tensorzero/tensorzero
TensorZero is an open-source stack for industrial-grade LLM applications. It unifies an LLM...