Data Movement
Vast.ai currently supports several built-in mechanisms to copy data to/from instance storage:
- Instance<->Instance and Instance<->Local copy using the CLI copy command
- Instance<->Instance copy in the GUI instance control panel
- Instance<->Cloud copy using the GUI instance control panel
These are in addition to standard ssh based copy protocols such as scp or sftp which you can run over ssh, built in jupyter http copy, and any other linux tools you can run inside the instance yourself (rclone, rsync, bittorent, etc).
The 3 built-in methods discussed here are unique in that they offer ways to copy data to/from a stopped instance, with some constraints.
You can use the CLI copy command to copy from/to directories on a remote instance and your local machine, or to copy data between two remote instances. You can use the copy buttons in the GUI to copy data between two remote instances. The copy command uses rsync and is generally fast and efficient, subject to single link upload/download constraints.
Example:
That will copy the local ~/workspace (workspace folder of current user home directory) into the absolute path /workspace on instance 4330147.
Currently, one endpoint of the copy must involve a vast instance with open ports. For a remote->local copy or local->remote copy, the remote instance must be on a machine with open ports (although the instance itself does not need open ports), and the remote instance can be stopped/inactive. For instances on machines WITHOUT open ports, copy to/from local is not available, but you can still copy to a 2nd vast instance with open ports.
For a remote->remote copy (copy between 2 instances), the src can be stopped and does not need open ports, but the dst must be a running instance with open ports. It is not sufficient that the instance is on a machine with open ports, the instance itself must have been created with open port mappings. If the instance is created with the direct connect option (for jupyter or ssh launch modes), the instance will have at least one open port. Otherwise, for proxy or entrypoint instance types, you can get open ports using the -p option to reserve a port in the instance configuration under run options (and you must also then pick a machine with open ports).
You should not copy to /root or / as a destination directory, as this can mess up the permissions on your instance ssh folder, breaking future copy operations (as they use ssh authentication).
If your data is already stored in the cloud (S3, gdrive, etc) then you should naturally use the appropriate linux CLI or commands to download and upload data directly, or you could use the cloud sync feature. This generally will be one the fastest methods for moving large quantities of data, as it can fully saturate a large number of download links. If you are using multiple instances with significant data movement requirements you will want to use high bandwidth cloud storage to avoid any single machine bottlenecks.
If you launched a Jupyter notebook instance, you can use its upload feature, but this has a file size limit and can be slow.
You can also use standard Linux tools like scp, ftp, rclone, or rsync to move data. For moving code and smaller files scp is fast enough and convenient. However, be warned that the default ssh connection uses a proxy and can be slow for large transfers (direct ssh recommended).
Instance to instance copy is generally as fast as other methods, and can be much faster (and cheaper) for moving data between instances on the same datacenter.
If you launched an ssh instance, you can copy files using scp. The default ssh connection uses a proxy and thus can be slow (in terms of latency and bandwidth). Thus we recommend only using scp over the default ssh connection for smaller transfers (less than 1 GB). For larger inbound transfers, a direct connection is recommended. Downloading from a cloud data store using wget or curl can have much higher performance.
The relevant scp command syntax is:
The PORT and IPADDR fields must match those from the ssh command. The "Connect" button on the instance will give you these fields in the form:
For example, if Connect gives you this:
You could use scp to upload a local file called "myfile.tar.gz" to a remote folder called "mydir" like so:
You can set up a 2nd direct ssh connection on instances with open ports by running sshd on an open port:
Instances with open ports will show their IP address and port range on the top of the instance card. Once you have sshd running on an open port, you can then use ssh/scp/rsync normally with the listed IP address and open port you chose. This direct connection can be faster than the default proxy ssh connection.