AI-Hypercomputer/JetStream

JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs welcome).

/ 100

Established

This project helps machine learning engineers and researchers efficiently run large language models (LLMs) on specialized hardware like Google's TPUs. It takes your trained LLM (built with frameworks like JAX or PyTorch) and outputs predictions or generated text much faster and using less memory, even under heavy demand. It's designed for those who need to deploy and serve LLMs to end-users at scale.

415 stars.

Use this if you are a machine learning engineer or researcher looking to optimize the performance, speed, and memory usage of your large language models when running them on XLA devices such as TPUs, especially for high-throughput serving.

Not ideal if you are a business user or data analyst without a technical background in machine learning deployment and infrastructure, or if you are primarily working with standard CPU or GPU environments without XLA-specific optimization needs.

LLM deployment MLOps AI infrastructure model serving TPU optimization

No Package No Dependents

Maintenance 6 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 19 / 25

How are scores calculated?

Stars

415

Forks

Language

Python

License

Apache-2.0

Related models

vllm-project/vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

sgl-project/sglang

SGLang is a high-performance serving framework for large language models and multimodal models.

alibaba/MNN

MNN: A blazing-fast, lightweight inference engine battle-tested by Alibaba, powering...

xorbitsai/inference

Swap GPT for any LLM by changing a single line of code. Xinference lets you run open-source,...

tensorzero/tensorzero

TensorZero is an open-source stack for industrial-grade LLM applications. It unifies an LLM...

Explore Transformer Models

All categories Trending Transformer directory Insights