Data Movement
You can use the CLI copy command to copy from/to directories on a remote instance and your local machine, or to copy data between two remote instances. You can use the copy buttons in the GUI to copy data between two remote instances. The copy command uses rsync and is generally fast and efficient, subject to single link upload/download constraints. Example:
Currently, one endpoint of the copy must involve a vast instance with open ports. For a remote->local copy or local->remote copy, the remote instance must be on a machine with open ports (although the instance itself does not need open ports), and the remote instance can be stopped/inactive. For instances on machines WITHOUT open ports, copy to/from local is not available, but you can still copy to a 2nd vast instance with open ports.
For a remote->remote copy (copy between 2 instances), the src can be stopped and does not need open ports, but the dst must be a running instance with open ports. It is not sufficient that the instance is on a machine with open ports, the instance itself must have been created with open port mappings. If the instance is created with the direct connect option (for jupyter or ssh launch modes), the instance will have at least one open port. Otherwise, for proxy or entrypoint instance types, you can get open ports using the -p option to reserve a port in the instance configuration under run options (and you must also then pick a machine with open ports).
If your data is already stored in the cloud (S3, gdrive, etc) then you should naturally use the appropriate linux CLI or commands to download and upload data directly. This generally will be one the fastest methods for moving large quantities of data, as it can fully saturate a large number of download links. If you are using multiple instances with significant data movement requirements you will want to use high bandwidth cloud storage and avoid any single machine bottlenecks.
If you launched a Jupyter notebook instance, you can use its upload feature, but this has a file size limit and can be slow.
You can also use standard Linux tools like scp, ftp, rclone, or rsync to move data. For moving code and smaller files scp is fast enough and convenient. However, be warned that the default ssh connection uses a proxy and can be slow for large transfers.
If you launched an ssh instance, you can copy files using scp. The default ssh connection uses a proxy and thus can be slow (in terms of latency and bandwidth). Thus we recommend only using scp over the default ssh connection for smaller transfers (less than 1 GB). For larger inbound transfers, a direct connection is recommended. Downloading from a cloud data store using wget or curl can have much higher performance.
The relevant scp command syntax is:
The PORT and IPADDR fields must match those from the ssh command. The "Connect" button on the instance will give you these fields in the form:
For example, if Connect gives you this:
You could use scp to upload a local file called "myfile.tar.gz" to a remote folder called "mydir" like so:
This seems to be due to bugs in the urllib3 and or requests libraries used by many Python packages. We recommend using wget to download large files - it is quite robust and recovers from errors gracefully.
First, you need to get the raw https link. Using the Chrome browser, on the Kaggle website go to the relevant dataset or competition page and start downloading the file you want. Then cancel the download and press Ctrl+J to bring up the Chrome Downloads page. At the top is the most recent download with a name and a link under it. Right-click on the URL link and use "copy link address". Then you can use wget with that URL as follows:
Notice the URL needs to be wrapped in ' ' single quotes.