vLLM (LLM inference and serving)
Below is a guide for runing the vLLM template on Vast. The template contains everything you need to get started, so you will only need to specify the model you want to serve and the corresponding vLLM configuration.
For simplicity, we have set the default template model as DeepSeek-R1-Distill-Llama-8B with a limited context window because it can run on a single GPU with only 21GB VRAM, but vLLM can scale easily over multiple GPUs to handle much larger models.
- Setup your Vast account and add credit: Review the quickstart guide to get familar with the service if you do not have an account with credits loaded.
vLLM serve is launched automatically by the template and it will use the configuration defined in the environment variables VLLM_MODEL and VLLM_ARGS. Here's how you can set it up
- Click the pencil button to open up the template editor.
- If you would like to run a model other than the default, edit the VLLM_MODELenvironment variable. The default value is deepseek-ai/DeepSeek-R1-Distill-Llama-8B which is a HuggingFace repository.
- You can also set the arguments to pass to vllm serve by modifying the VLLM_ARGS environment variable. vLLM is highly configurable so it's a good idea to check the official documentation before changing anything here. All available startup arguments are listed in the official vLLM documentation.
- Save the template. You will be able to find the version you have just modified in the templates page in the 'My Templates' section.
- Select the template you just saved from the 'My Templates' section of the templates page.
- Click the Play icon on this template to be taken to view the available offers.
- Use the search filters to select a suitable GPU, ensuring that you have sufficient VRAM to load all of the model's layers to GPU.
- From the search menu, ensure you have sufficient disk space for the model you plan to run. The disk slider is located under the template icon on the left hand column. Large models (e.g., 70B parameters) can require dozens of gigabytes of storage. For Deep Seek R1 8B, make sure to allocate over 17Gb of disk space using the slider.
- Click Rent on a suitable instance and wait for it to load
Once the instance has loaded you'll be able to click the Open button to access the instance portal where you'll see links to the interactive vLLM API documentation and the Ray control panel.
As vLLM must download your model upon first run it may take some time before the API is available. You can follow the startup progress in the instance logs.
The vLLM API can be accessed programmatically at:
- When making requests, you must include an Authorization header with the token value of OPEN_BUTTON_TOKEN.
- -k: Allows curl to perform insecure SSL connections and transfers as Vast.ai uses a self-signed certificate.
- Replace INSTANCE_IP and EXTERNAL_PORT with the externally mapped port for 8000 from the IP button on the instance.
- Update the Authorization header value to match your OPEN_BUTTON_TOKEN. You can get that from any of the links in the Instance Portal or from the Open button on the instance card.
- Modify the prompt, model, and other fields (max_tokens, temperature, etc.) as needed.
Although the instance starts the vllm serve function to provide an inference API, the template has been configured with Jupyter and SSH access so you can also interact with vLLM in code from your instance. To do this simply include the vllm modules at the top of your Python script:
Please see the template Readme file on our recommended vLLM template for advanced template configuration and other methods of connecting to and interacting with your instance.