Use Cases
AI Text Generation
vLLM (LLM inference and serving)
9min
below is a guide for runing the vllm template on vast the template contains everything you need to get started, so you will only need to specify the model you want to serve and the corresponding vllm configuration for simplicity, we have set the default template model as deepseek r1 distill llama 8b with a limited context window because it can run on a single gpu with only 21gb vram, but vllm can scale easily over multiple gpus to handle much larger models set up your account setup your vast account and add credit review the quickstart guide to get familar with the service if you do not have an account with credits loaded configure the vllm template vllm serve is launched automatically by the template and it will use the configuration defined in the environment variables vllm model and vllm args here's how you can set it up vist the templates page and find the recommended vllm template click the pencil button to open up the template editor if you would like to run a model other than the default, edit the vllm model environment variable the default value is deepseek ai/deepseek r1 distill llama 8b which is a huggingface repository you can also set the arguments to pass to vllm serve by modifying the vllm args environment variable vllm is highly configurable so it's a good idea to check the official documentation before changing anything here all available startup arguments are listed in the official vllm documentation save the template you will be able to find the version you have just modified in the templates page in the 'my templates' section launch your instance select the template you just saved from the 'my templates' section of the templates page click the play icon on this template to be taken to view the available offers use the search filters to select a suitable gpu, ensuring that you have sufficient vram to load all of the model's layers to gpu from the search menu, ensure you have sufficient disk space for the model you plan to run the disk slider is located under the template icon on the left hand column large models (e g , 70b parameters) can require dozens of gigabytes of storage for deep seek r1 8b, make sure to allocate over 17gb of disk space using the slider click rent on a suitable instance and wait for it to load once the instance has loaded you'll be able to click the open button to access the instance portal where you'll see links to the interactive vllm api documentation and the ray control panel as vllm must download your model upon first run it may take some time before the api is available you can follow the startup progress in the instance logs vllm api usage the vllm api can be accessed programmatically at https //instance ip\ port 8000 authentication token when making requests, you must include an authorization header with the token value of open button token sample curl command curl k https //instance ip\ external port/v1/completions \\ h "content type application/json" \\ h "authorization bearer 7b040f8d37017016a336a804a8039068d7c744850f3a441db48d6da559379058" \\ d '{ "model" "deepseek ai/deepseek r1 distill llama 8b", "prompt" "san francisco is a", "max tokens" 128, "temperature" 0 6 }' k allows curl to perform insecure ssl connections and transfers as vast ai uses a self signed certificate replace instance ip and external port with the externally mapped port for 8000 from the ip button on the instance update the authorization header value to match your open button token you can get that from any of the links in the instance portal or from the open button on the instance card modify the prompt, model, and other fields (max tokens, temperature, etc ) as needed vllm with python although the instance starts the vllm serve function to provide an inference api, the template has been configured with jupyter and ssh access so you can also interact with vllm in code from your instance to do this simply include the vllm modules at the top of your python script from vllm import llm, samplingparams further reading please see the template readme file on our recommended vllm template for advanced template configuration and other methods of connecting to and interacting with your instance