Distributed Computing
Multi-Node training using Torch + NCCL
1 min
need roce or infiniband? submit a cluster request availability currently limited to a100/h100/h200 machines and longer rentals nccl expects all nodes to be on the same network by default, vast instances on different physical machines are on separate bridge networks isolated from the host's lan and must go through a nat to reach the outside internet vast now supports creating overlay networks for instances, allowing client instances on different machines on the same physical lan to share a private, virtual lan separate from both the host network and the networks of other clients' instances this allows direct communication between instances