kubeflow/trainer

Distributed AI Model Training and LLM Fine-Tuning on Kubernetes

/ 100

Verified

This platform helps AI practitioners efficiently train and fine-tune large AI models, including Large Language Models (LLMs), on powerful computing clusters. You provide your AI model architecture and training data, and the platform manages the complex process of distributing the workload across multiple GPUs and machines to produce a highly optimized, trained model. It's designed for AI engineers, data scientists, and ML researchers working with large-scale AI projects.

2,050 stars. Actively maintained with 29 commits in the last 30 days.

Use this if you need to train or fine-tune very large AI models, especially LLMs, and require a scalable, distributed system to manage your multi-GPU and multi-node computing resources efficiently.

Not ideal if you are working with smaller AI models that can be trained on a single machine, or if you prefer not to manage Kubernetes-based infrastructure for your AI workloads.

AI model training LLM fine-tuning distributed machine learning high-performance computing machine learning operations

No Package No Dependents

Maintenance 20 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 25 / 25

How are scores calculated?

Stars

2,050

Forks

925

Language

License

Apache-2.0

Related tools

nndeploy/nndeploy

一款简单易用和高性能的AI部署框架 | An Easy-to-Use and High-Performance AI Deployment Framework

bentoml/BentoML

The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps,...

cncf/llm-in-action

🤖 Discover how to apply your LLM app skills on Kubernetes!

llmcloud24/de.KCD-Summer-School-2024

Learn how to deploy your own LLM in the de.NBI cloud via a step-by-step guided journey...

ray-project/llms-in-prod-workshop-2023

Deploy and Scale LLM-based applications

Explore MLOps Tools

All categories Trending MLOps directory Insights