microsoft/batch-inference

Dynamic batching library for Deep Learning inference. Tutorials for LLM, GPT scenarios.

/ 100

Emerging

This toolkit helps Python developers efficiently serve Deep Learning models, especially on cloud GPUs, by automatically grouping individual requests into larger batches. Developers provide a model that can process a batch of inputs, and the toolkit handles the complex logistics of combining incoming requests and then splitting the results back to each original request. This process significantly improves the speed at which the server can handle many simultaneous requests for tasks like text embeddings or GPT completions.

106 stars. No commits in the last 6 months. Available on PyPI.

Use this if you are a Python developer hosting Deep Learning models on cloud servers and want to increase the number of inference requests your server can handle per second.

Not ideal if you are not a Python developer, or if you are not deploying Deep Learning models for high-throughput inference.

deep-learning-deployment model-serving cloud-inference large-language-models machine-learning-engineering

Stale 6m

Maintenance 0 / 25

Adoption 9 / 25

Maturity 25 / 25

Community 7 / 25

How are scores calculated?

Stars

106

Forks

Language

Python

License

MIT

Higher-rated alternatives

Blaizzy/mlx-vlm

MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac...

b4rtaz/distributed-llama

Distributed LLM inference. Connect home devices into a powerful cluster to accelerate LLM...

armbues/SiLLM

SiLLM simplifies the process of training and running Large Language Models (LLMs) on Apple...

armbues/SiLLM-examples

Examples for using the SiLLM framework for training and running Large Language Models (LLMs) on...

kolinko/effort

An implementation of bucketMul LLM inference

Explore Transformer Models

All categories Trending Transformer directory Insights