Need RoCE or Infiniband? Submit a cluster request. Availability currently limited to A100/H100/H200 machines.
Note: Private networking currently only available on Docker-based templates; not available for VM-based templates.
Creating a Virtual Cluster
- Make sure to update to/install the newest version of the CLI first: go to our CLI docs and copy+run the command starting with
wget. - View physical clusters with instances matching your requirements by running
./vast search offers --raw cluster_id!=None [YOUR_INSTANCE_SEARCH_FILTERS] | grep cluster_id- This will print out cluster_ids for clusters with offers available for instances matching your search parameters.
- For a detailed view of the available offers within a specific cluster, run
./vast search offers cluster_id=CLUSTER_ID [YOUR_INSTANCE_SEARCH_FILTERS]
- Once you’ve chosen a physical cluster, create your overlay network inside the cluster---
./vast create overlay CLUSTER_ID NAME_FOR_NETWORK_TO_CREATE
- Search for instance offers in the physical cluster you created your overlay network in---
./vast search offers cluster_id=CLUSTER_ID [YOUR_INSTANCE_SEARCH_FILTERS]
- Create instances attached to your overlay by appending
--env "-n YOUR_NETWORK_NAME"to your./vast create instancecommand.
TCP Initialization for NCCL + PyTorch
Depending on your setup, you will have one or more worker processes running on each node. NCCL expects each worker process to be assigned a unique rank that’s an integer from 0-(NUM_WORKERS - 1). NCCL expects to be able to perform a TCP rendezvous during initialization at the local IP address of the node running the rank 0 worker process.Finding the IPv4 address for TCP rendezvous
- On the node that will run the rank 0 worker, run
ip a(apt install iproute2if not already installed).- You should have three network interfaces:
lo,eth0, andeth1. - Unless you added/removed networks after instant creation,
eth0should be the interface to the overlay network between your instances. (lois the loopback interface;eth1is a bridge to the host machine’s gateway to the external internet).- Under the
eth0entry, there should be the line that starts withinet IPv4ADDRESS/MASK, thisIPv4ADDRESSwill be the address you will want to use for TCP initialization.
- Under the
- You should have three network interfaces:
Running the training script
- In your training script, you’ll want to initialize your process group at the beginning every worker process with the parameters
backend='nccl'andinit_method = 'tcp://IPv4ADDRESS:PORT'whereIPv4ADDRESSis the IPv4 address of youreth0device as found using the instructions above, and port is a free port number chosen between 1000 and 65535 (all ports are exposed between instances on the same overlay network). - You may need to set the
NCCL_SOCKET_IFNAME=eth0environment variable for the script, as NCCL is sometimes unable to detect that theeth1device on the different nodes are not directly connected to each other. - Other debugging notes:
- NCCL may not initialize all channels until the first communication function is called.
- Setting the
NCCL_DEBUG=INFOenvironment variable may be useful for getting additional debug info. - PyTorch sometimes does not block on communication methods finishing until the output tensors area actually used.
Example
Here we will use a python script callednccl_speedtest.py using the following contents:
Python
apt update; apt install iproute2 then run ip a:
We should get output that looks like this ----
Text
10.0.0.1 as our rendezvous address; we can choose any available port above 1000 (e.g. 5000) for our rendezvous port.
Then, run NCCL_SOCKET_IFNAME=eth0 python3 nccl_speedtest.py 10.0.0.1:5000 10G 0
The script will start, then, once it reaches init_process_group it will wait for the worker process on the other node to reach the same point and complete the rendezvous before proceeding.
On the second instance, we run NCCL_SOCKET_IFNAME=eth0 python3 nccl_speedtest.py 10.0.0.1:5000 10G 1
Once we’ve done the script on the second instance reaches the TCP rendezvous, both processes will continue and start communicating over NCCL.