Key features
- High-speed networking from 1600 to 3200 Gbps within a single data center.
- On-demand clusters are available from 2-8 nodes (16-64 GPUs)
- Contact our sales team for larger clusters (up to 512 GPUs).
- Supports H200, B200, H100, and A100 GPUs.
- Automatic cluster configuration with static IP and environment variables.
- Multiple deployment options for different frameworks and use cases.
Networking performance
Instant Clusters feature high-speed local networking for efficient data movement between nodes:- Most clusters include 3200 Gbps networking.
- A100 clusters offer up to 1600 Gbps networking.
Zero configuration
Runpod automates cluster setup so you can focus on your workloads:- Clusters are pre-configured with static IP address management.
- All necessary environment variables for distributed training are pre-configured.
- Supports popular frameworks like PyTorch, TensorFlow, and Slurm.
Get started
Choose the tutorial that matches your preferred framework and use case. Deploy a Slurm cluster: Set up a managed Slurm cluster for high-performance computing workloads. Slurm provides job scheduling, resource allocation, and queue management for research environments and batch processing workflows. Deploy a PyTorch distributed training cluster: Set up multi-node PyTorch training for deep learning models. This tutorial covers distributed data parallel training, gradient synchronization, and performance optimization techniques. Deploy an Axolotl fine-tuning cluster: Use Axolotl’s framework for fine-tuning large language models across multiple GPUs. This approach simplifies customizing pre-trained models like Llama or Mistral with built-in training optimizations. Deploy an unmanaged Slurm cluster: For advanced users who need full control over Slurm configuration. This option provides a basic Slurm installation that you can customize for specialized workloads. You can also follow this video tutorial to learn how to deploy Kimi K2 using Instant Clusters.All accounts have a default spending limit. To deploy a larger cluster, submit a support ticket at help@runpod.io.
Network interfaces
High-bandwidth interfaces (ens1, ens2, etc.) handle communication between nodes, while the management interface (eth0) manages external traffic. The NCCL environment variable NCCL_SOCKET_IFNAME uses all available interfaces by default. The PRIMARY_ADDR corresponds to ens1 to enable launching and bootstrapping distributed processes.
Instant Clusters support up to 8 interfaces per node. Each interface (ens1 - ens8) provides a private network connection for inter-node communication, made available to distributed backends such as NCCL or GLOO.
Environment variables
The following environment variables are present in all nodes in an Instant Cluster:| Environment Variable | Description |
|---|---|
PRIMARY_ADDR / MASTER_ADDR | The address of the primary node. |
PRIMARY_PORT / MASTER_PORT | The port of the primary node. All ports are available. |
NODE_ADDR | The static IP of this node within the cluster network. |
NODE_RANK | The cluster rank (i.e. global rank) assigned to this node. NODE_RANK = 0 for the primary node. |
NUM_NODES | The number of nodes in the cluster. |
NUM_TRAINERS | The number of GPUs per node. |
HOST_NODE_ADDR | A convenience variable, defined as PRIMARY_ADDR:PRIMARY_PORT. |
WORLD_SIZE | The total number of GPUs in the cluster (NUM_NODES * NUM_TRAINERS). |
NODE_ADDR) on the overlay network. When a cluster is deployed, the system designates one node as the primary node by setting the PRIMARY_ADDR and PRIMARY_PORT environment variables. This simplifies working with multiprocessing libraries that require a primary node.
The following variables are equivalent:
MASTER_ADDRandPRIMARY_ADDRMASTER_PORTandPRIMARY_PORT.
MASTER_* variables are available to provide compatibility with tools that expect these legacy names.
NCCL configuration for multi-node training
For distributed training frameworks like PyTorch, you must explicitly configure NCCL to use the internal network interface to ensure proper inter-node communication:When to use Instant Clusters
Instant Clusters offer distributed computing power beyond the capabilities of single-machine setups. Consider using Instant Clusters for:- Multi-GPU language model training: Accelerate training of models like Llama or GPT across multiple GPUs.
- Large-scale computer vision projects: Process massive imagery datasets for autonomous vehicles or medical analysis.
- Scientific simulations: Run climate, molecular dynamics, or physics simulations that require massive parallel processing.
- Real-time AI inference: Deploy production AI models that demand multiple GPUs for fast output.
- Batch processing pipelines: Create systems for large-scale data processing, including video rendering and genomics.