Distributed Computing
Multi-Node training using Torch + NCCL
7 min
need roce or infiniband? submit a cluster request availability currently limited to a100/h100/h200 machines note private networking currently only available on docker based templates; not available for vm based templates nccl expects all nodes to be on the same network by default, vast instances on different physical machines are on separate bridge networks isolated from the host's lan and must go through a nat to reach the outside internet vast now supports creating overlay networks for instances, allowing client instances on different machines on the same physical lan to share a private, virtual lan separate from both the host network and the networks of other clients' instances overlay networks can be created for instances located in the same physical cluster these are groups of machines that support fast local networking to each other this allows direct communication between the instances on all ports, which is expected by nccl creating a virtual cluster make sure to update to/install the newest version of the cli first go to our cli docs and copy+run the command starting with wget view physical clusters with instances matching your requirements by running /vast search offers raw cluster id!=none \[your instance search filters] | grep cluster id this will print out cluster ids for clusters with offers available for instances matching your search parameters for a detailed view of the available offers within a specific cluster, run /vast search offers cluster id=cluster id \[your instance search filters] once you've chosen a physical cluster, create your overlay network inside the cluster /vast create overlay cluster id name for network to create search for instance offers in the physical cluster you created your overlay network in /vast search offers cluster id=cluster id \[your instance search filters] create instances attached to your overlay by appending env " n your network name" to your /vast create instance command tcp initialization for nccl + pytorch depending on your setup, you will have one or more worker processes running on each node nccl expects each worker process to be assigned a unique rank that's an integer from 0 (num workers 1) nccl expects to be able to perform a tcp rendezvous during initialization at the local ip address of the node running the rank 0 worker process finding the ipv4 address for tcp rendezvous on the node that will run the rank 0 worker, run ip a ( apt install iproute2 if not already installed) you should have three network interfaces lo , eth0 , and eth1 unless you added/removed networks after instant creation, eth0 should be the interface to the overlay network between your instances ( lo is the loopback interface; eth1 is a bridge to the host machine's gateway to the external internet) under the eth0 entry, there should be the line that starts with inet ipv4address/mask , this ipv4address will be the address you will want to use for tcp initialization running the training script in your training script, you'll want to initialize your process group at the beginning every worker process with the parameters backend='nccl' and init method = 'tcp\ //ipv4address\ port' where ipv4address is the ipv4 address of your eth0 device as found using the instructions above, and port is a free port number chosen between 1000 and 65535 (all ports are exposed between instances on the same overlay network) you may need to set the nccl socket ifname=eth0 environment variable for the script, as nccl is sometimes unable to detect that the eth1 device on the different nodes are not directly connected to each other other debugging notes nccl may not initialize all channels until the first communication function is called setting the nccl debug=info environment variable may be useful for getting additional debug info pytorch sometimes does not block on communication methods finishing until the output tensors area actually used example here we will use a python script called nccl speedtest py using the following contents import torch as t import torch distributed as dist import sys import time import string \# tests nccl bandwidth between two nodes \# run this script on both nodes, setting one as rank 0 and the other as rank 1 \# invoke python3 nccl speedtest py node 0 ip\ port size\[k|m|g] rank(0|1) if name == " main " handshake ip = sys argv\[1] size s = sys argv\[2] split idx = size s find(string ascii letters) sizes = { "k" 1024, "m" 1024 2, "g" 1024 3, "" 1} size = int(size s\[0\ split idx]) sizes\[size s\[split idx ]] rank = int(sys argv\[3]) if len(sys argv) >= 5 device = int(sys argv\[4]) else device = 0 print("initializing tensors ") \# number of fp32 to allocate is bytes >> 2 v1 = t rand(size>>3, device=f'cuda {device}') # for bidirectional test warmup1 = t rand(size>>13, device=f'cuda {device}') if rank warmup = t rand(size>>12, device=f'cuda {device}') v = t rand(size>>2, device=f'cuda {device}') else warmup = t zeros(size>>12,device=f'cuda {device}') v = t zeros(size>>2, device=f'cuda {device}') print("executing nccl tcp handshake ") dist init process group(init method = f"tcp\ //{handshake ip}", rank = rank, world size=2) print("nccl tcp handshake done, warming up connection ") if rank dist send(warmup, 0) else dist recv(warmup,1) ignore = t sum(warmup) to('cpu') # force sync print("warmup done; starting uni directional speedtest ") start = time time() if rank dist send(v, 0) else dist recv(v,1) \# torch returns from dist send/dist recv as soon as the communication channels initialize; it does not block on the full tensor being received \# t sum(v) will block on communication operations on v completing though, so we don't check end time until that is done checksum = t sum(v) to('cpu') end = time time() print(f"checksum {checksum}") print(f"elapsed {end start}") print(f"unidirectional bandwidth {size / (end start) / sizes\['m']} mib/s") print("warming up bidirection speedtest ") dist all gather into tensor(warmup,warmup1) print("warmup done, starting bidirectional speedtest ") start = time time() dist all gather into tensor(v, v1) checksum = t sum(v) to('cpu') end = time time() print(f"checksum {checksum}") print(f"elapsed {end start}") print(f"bidirectional bandwidth {size / (end start) / sizes\['m']} mib/s") print("done, cleaning up!") dist destroy process group() we will have rented two instances on the same overlay network already on the first instance run apt update; apt install iproute2 then run ip a we should get output that looks like this 1 lo \<loopback,up,lower up> mtu 65536 qdisc noqueue state unknown group default qlen 1000 link/loopback 00 00 00 00 00 00 brd 00 00 00 00 00 00 inet 127 0 0 1/8 scope host lo valid lft forever preferred lft forever inet6 1/128 scope host noprefixroute valid lft forever preferred lft forever 2 eth0\@if23 \<broadcast,multicast,up,lower up> mtu 65536 qdisc noqueue state unknown group default qlen 1000 link/ether 62 82\ b2 1b 38\ a6 brd ff\ ff\ ff\ ff\ ff\ ff link netnsid 0 inet 10 0 0 1/24 scope global eth0 valid lft forever preferred lft forever 3 lo \<broadcast,multicast,lower up> mtu 1500 qdisc noqueue state unknown group default qlen 1000 link/ether 94 04\ a2\ fb\ a1 66 brd ff\ ff\ ff\ ff\ ff\ ff link netnsid 0 inet 172 17 0 2/16 brd 172 17 255 255 scope global eth1 valid lft forever preferred lft forever from this we see that we will want to use 10 0 0 1 as our rendezvous address; we can choose any available port above 1000 (e g 5000 ) for our rendezvous port then, run nccl socket ifname=eth0 python3 nccl speedtest py 10 0 0 1 5000 10g 0 the script will start, then, once it reaches init process group it will wait for the worker process on the other node to reach the same point and complete the rendezvous before proceeding on the second instance, we run nccl socket ifname=eth0 python3 nccl speedtest py 10 0 0 1 5000 10g 1 once we've done the script on the second instance reaches the tcp rendezvous, both processes will continue and start communicating over nccl