higgsfield-ai/higgsfield

Fault-tolerant, highly scalable GPU orchestration, and a machine learning framework designed for training models with billions to trillions of parameters

/ 100

Emerging

This project helps machine learning engineers and researchers efficiently train extremely large AI models, like large language models (LLMs), across multiple GPUs and servers. It takes your Python training code and manages the allocation of computational resources, monitors training progress, and handles fault tolerance. The output is a fully trained, multi-trillion parameter model ready for deployment or further experimentation.

3,558 stars. No commits in the last 6 months.

Use this if you are a machine learning engineer or researcher struggling with the complexity and resource management challenges of training massive deep learning models on distributed GPU infrastructure.

Not ideal if you are working with smaller models that can be trained on a single GPU or if you prefer manual orchestration of your distributed training jobs.

large-language-models distributed-training GPU-orchestration deep-learning-research ML-infrastructure

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 23 / 25

How are scores calculated?

Stars

3,558

Forks

590

Language

Jupyter Notebook

License

Apache-2.0

Higher-rated alternatives

vllm-project/vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

sgl-project/sglang

SGLang is a high-performance serving framework for large language models and multimodal models.

alibaba/MNN

MNN: A blazing-fast, lightweight inference engine battle-tested by Alibaba, powering...

xorbitsai/inference

Swap GPT for any LLM by changing a single line of code. Xinference lets you run open-source,...

tensorzero/tensorzero

TensorZero is an open-source stack for industrial-grade LLM applications. It unifies an LLM...

Explore Transformer Models

All categories Trending Transformer directory Insights