Serving Infinity Embeddings with Vast.ai
Background:
Infinity Embeddings is a helpful serving framework to serve embedding models. It is particularly great at enabling embedding, re-ranking, and classification out of the box. It supports multiple different runtime frameworks to deploy on different types of GPU’s while still achieving great speed. Infinity Embeddings also supports dynamic batching, which allows it to process requests faster under significant load. One of its best features is that you can deploy multiple models on the same GPU at the same time, which is particularly helpful as often times embedding models are much smaller than GPU RAM. We also love that it complies with the OpenAI embeddings spec, which enables developers to quickly integrate this into their application for rag, clustering, classification and re-ranking tasks. This guide will show you how to setup Infinity Embeddings to serve an LLM on Vast. We reference a note book that you can use hereBash
Bash
Bash
Deploying the Image:
Hosting a Single Embedding Model:
For now, we’ll host just one embedding model. The easiest way to deploy a single model on this instance is to use the command line. Copy and paste a specific instance id you choose from the list above intoinstance-id
below.
We particularly need v2
so that we use the correct version of the api, --port 8000
so it serves on the correct model, and --model-id michaelfeil/bge-small-en-v1.5
to serve the correct model.
Bash
Connecting and Testing:
Once your instance is done setting up, you should see something like this:
IP_address_view

Instance_view
Bash
Python
Advanced Usage: Rerankers, Classifiers, and Multiple Models at the same time
The following steps will show you how to use Rerankers, Classifiers, and deploy them at the same time. First, we’ll deploy two models on the same GPU and container, the first is a reranker and the second is a classifier. Note that all we’ve done is change the value for--model-id
, and added a new --model-id
with its own value. These represent the two different models that we’re running.
Bash
Infinity
’s API spec. Add your new IP address and Port here:
Python
Python
Python