Need RoCE or Infiniband? Submit a cluster request. Availability currently limited to A100/H100/H200 machines.
Note: Private networking currently only available on Docker-based templates; not available for VM-based templates.
Creating a Virtual Cluster
- Make sure to update to/install the newest version of the CLI first: go to our CLI docs and copy+run the command starting with
wget
. - View physical clusters with instances matching your requirements by running
./vast search offers --raw cluster_id!=None [YOUR_INSTANCE_SEARCH_FILTERS] | grep cluster_id
- This will print out cluster_ids for clusters with offers available for instances matching your search parameters.
- For a detailed view of the available offers within a specific cluster, run
./vast search offers cluster_id=CLUSTER_ID [YOUR_INSTANCE_SEARCH_FILTERS]
- Once you’ve chosen a physical cluster, create your overlay network inside the cluster---
./vast create overlay CLUSTER_ID NAME_FOR_NETWORK_TO_CREATE
- Search for instance offers in the physical cluster you created your overlay network in---
./vast search offers cluster_id=CLUSTER_ID [YOUR_INSTANCE_SEARCH_FILTERS]
- Create instances attached to your overlay by appending
--env "-n YOUR_NETWORK_NAME"
to your./vast create instance
command.
TCP Initialization for NCCL + PyTorch
Depending on your setup, you will have one or more worker processes running on each node. NCCL expects each worker process to be assigned a unique rank that’s an integer from 0-(NUM_WORKERS - 1). NCCL expects to be able to perform a TCP rendezvous during initialization at the local IP address of the node running the rank 0 worker process.Finding the IPv4 address for TCP rendezvous
- On the node that will run the rank 0 worker, run
ip a
(apt install iproute2
if not already installed).- You should have three network interfaces:
lo
,eth0
, andeth1
. - Unless you added/removed networks after instant creation,
eth0
should be the interface to the overlay network between your instances. (lo
is the loopback interface;eth1
is a bridge to the host machine’s gateway to the external internet).- Under the
eth0
entry, there should be the line that starts withinet IPv4ADDRESS/MASK
, thisIPv4ADDRESS
will be the address you will want to use for TCP initialization.
- Under the
- You should have three network interfaces:
Running the training script
- In your training script, you’ll want to initialize your process group at the beginning every worker process with the parameters
backend='nccl'
andinit_method = 'tcp://IPv4ADDRESS:PORT'
whereIPv4ADDRESS
is the IPv4 address of youreth0
device as found using the instructions above, and port is a free port number chosen between 1000 and 65535 (all ports are exposed between instances on the same overlay network). - You may need to set the
NCCL_SOCKET_IFNAME=eth0
environment variable for the script, as NCCL is sometimes unable to detect that theeth1
device on the different nodes are not directly connected to each other. - Other debugging notes:
- NCCL may not initialize all channels until the first communication function is called.
- Setting the
NCCL_DEBUG=INFO
environment variable may be useful for getting additional debug info. - PyTorch sometimes does not block on communication methods finishing until the output tensors area actually used.
Example
Here we will use a python script callednccl_speedtest.py
using the following contents:
Python
apt update; apt install iproute2
then run ip a
:
We should get output that looks like this ----
Text
10.0.0.1
as our rendezvous address; we can choose any available port above 1000 (e.g. 5000
) for our rendezvous port.
Then, run NCCL_SOCKET_IFNAME=eth0 python3 nccl_speedtest.py 10.0.0.1:5000 10G 0
The script will start, then, once it reaches init_process_group
it will wait for the worker process on the other node to reach the same point and complete the rendezvous before proceeding.
On the second instance, we run NCCL_SOCKET_IFNAME=eth0 python3 nccl_speedtest.py 10.0.0.1:5000 10G 1
Once we’ve done the script on the second instance reaches the TCP rendezvous, both processes will continue and start communicating over NCCL.