Llama.cpp
Llama.cpp is the recommended method for loading these models as it is able to directly load a split file of many parts without first merging them. While it’s easy to build llama.cpp inside one of our instances, we will focus on running this model in the Open WebUI template which contains a pre-compiled CUDA compatible versions of llama-server and llama-cli.Open WebUI Template
OpenWebui + Ollama is one of our recommended templates. While its default setup uses Ollama as a backend, it can also access an OpenAI-compatible API and it has been pre-configured to find one running onhttp://localhost:20000
A full guide to getting started with the OpenWebUI template is available here
Ensure you have enough disk space and a suitable configuration. For Deepseek-R1 Q8_0 you’ll need:
- At least 800GB VRAM
- 700GB storage space
Pulling the model
You will want to download the models from the Deepseek-R1 Q8_0 model hugging face repo to the/workspace/llama.cpp/models
directory on your instance. We have included a script with the Ollama + Open WebUI template that you may use to easily download the models.
Bash
Serving the model
Once the dowload has completed it’s time to serve the model using the pre-builtllama-server
application.
Again, from the terminal, type the following:
JavaScript
Building Llama.cpp
If you prefer to build llama.cpp yourself, you can simply run the following from any Vast-built template. The Recommended Nvidia CUDA template would be an ideal start.Bash
llama-quantize
llama-cli
llama-server
and llama-gguf-split
tools.
For advanced build instructions you should see the official documentation on GitHub.