Use Cases

Infinity Embeddings

11min

Serving Infinity Embeddings with Vast.ai

Background:

Infinity Embeddings is a helpful serving framework to serve embedding models. It is particularly great at enabling embedding, re-ranking, and classification out of the box. It supports multiple different runtime frameworks to deploy on different types of GPU’s while still achieving great speed. Infinity Embeddings also supports dynamic batching, which allows it to process requests faster under significant load.

One of its best features is that you can deploy multiple models on the same GPU at the same time, which is particularly helpful as often times embedding models are much smaller than GPU RAM. We also love that it complies with the OpenAI embeddings spec, which enables developers to quickly integrate this into their application for rag, clustering, classification and re-ranking tasks.

This guide will show you how to setup Infinity Embeddings to serve an LLM on Vast. We reference a note book that you can use here

Bash


Once you create your account, you can go here to set your API Key.

Bash


For serving an LLM, we're looking for a machine that has a static IP address, ports available to host on, plus a single modern GPU with decent RAM since these embedding models will be small. We will query the vast API to get a list of these types of machines.

Bash


Deploying the Image:

Hosting a Single Embedding Model:

For now, we'll host just one embedding model. The easiest way to deploy a single model on this instance is to use the command line. Copy and paste a specific instance id you choose from the list above into instance-id below.

We particularly need v2 so that we use the correct version of the api, --port 8000 so it serves on the correct model, and --model-id michaelfeil/bge-small-en-v1.5 to serve the correct model.

Bash


Connecting and Testing:

Once your instance is done setting up, you should see something like this:

IP_address_view
IP_address_view


Click on the highlighted button to see the IP address and correct port for our requests. To connect to your instance, we'll first need to get the IP address and port number.

Instance_view
Instance_view


Now we'll call this with the Open AI SDK:

Bash


We will copy over the IP address and the port into the cell below.

Python


In this, we can see that the embeddings from our model. Feel free to delete this instance as we'll redeploy a different configuration now.

Advanced Usage: Rerankers, Classifiers, and Multiple Models at the same time

The following steps will show you how to use Rerankers, Classifiers, and deploy them at the same time. First, we'll deploy two models on the same GPU and container, the first is a reranker and the second is a classifier. Note that all we've done is change the value for --model-id, and added a new --model-id with its own value. These represent the two different models that we're running.

Bash


Now, we'll call these models with the requests library and follow Infinity's API spec. Add your new IP address and Port here:

Python

Python


We can see from the output of the cell that it gives us a list of jsons for each score, in order of highest relevance. Therefore in this case, the first entry in the list had a relevancy of .74, meaning that it "won" the ranking of samples for this query.

And we'll now query the classification model:

Python


We can see from this that the most likely emotion from this model's choices was "joy".

So there you have it, now you can see how with Vast and Infinity, you can serve embedding, reranking, and classifier models all from just one GPU on the most affordable compute on the market.