Infinity Embeddings

11 min

serving infinity embeddings with vast ai background infinity embeddings is a helpful serving framework to serve embedding models it is particularly great at enabling embedding, re ranking, and classification out of the box it supports multiple different runtime frameworks to deploy on different types of gpu’s while still achieving great speed infinity embeddings also supports dynamic batching, which allows it to process requests faster under significant load one of its best features is that you can deploy multiple models on the same gpu at the same time, which is particularly helpful as often times embedding models are much smaller than gpu ram we also love that it complies with the openai embeddings spec, which enables developers to quickly integrate this into their application for rag, clustering, classification and re ranking tasks this guide will show you how to setup infinity embeddings to serve an llm on vast we reference a note book that you can use here https //nbviewer org/urls/bitbucket org/%21api/2 0/snippets/jsbcannell/ne66og/f86a1c070ddc362abc6572eb300926a0b7190ad3/files/serve infinity on vast ipynb pip install upgrade vastai once you create your account, you can go here https //cloud vast ai/cli/ to set your api key vastai set api key \<your api key here> for serving an llm, we're looking for a machine that has a static ip address, ports available to host on, plus a single modern gpu with decent ram since these embedding models will be small we will query the vast api to get a list of these types of machines vastai search offers 'compute cap > 800 gpu ram > 20 num gpus = 1 static ip=true direct port count > 1' deploying the image hosting a single embedding model for now, we'll host just one embedding model the easiest way to deploy a single model on this instance is to use the command line copy and paste a specific instance id you choose from the list above into instance id below we particularly need v2 so that we use the correct version of the api, port 8000 so it serves on the correct model, and model id michaelfeil/bge small en v1 5 to serve the correct model vastai create instance \<instance id> image michaelf34/infinity\ latest env ' p 8000 8000' disk 40 args v2 model id michaelfeil/bge small en v1 5 port 8000 connecting and testing once your instance is done setting up, you should see something like this ip address view click on the highlighted button to see the ip address and correct port for our requests to connect to your instance, we'll first need to get the ip address and port number instance view now we'll call this with the open ai sdk pip install openai we will copy over the ip address and the port into the cell below from openai import openai \# modify openai's api key and api base to use vllm's's api server openai api key = "empty" openai api base = "http //\<instance ip address> \<port>/v1" client = openai( api key=openai api key, base url=openai api base, ) model = "michaelfeil/bge small en v1 5" embeddings = client embeddings create(model=model, input="what is deep learning?") data\[0] embedding print("embeddings ") print(embeddings) in this, we can see that the embeddings from our model feel free to delete this instance as we'll redeploy a different configuration now advanced usage rerankers, classifiers, and multiple models at the same time the following steps will show you how to use rerankers, classifiers, and deploy them at the same time first, we'll deploy two models on the same gpu and container, the first is a reranker and the second is a classifier note that all we've done is change the value for model id , and added a new model id with its own value these represent the two different models that we're running vastai create instance \<instance id> image michaelf34/infinity\ latest env ' p 8000 8000' disk 40 args v2 model id mixedbread ai/mxbai rerank xsmall v1 model id samlowe/roberta base go emotions port 8000 now, we'll call these models with the requests library and follow infinity 's api spec add your new ip address and port here import requests base url = "http //\<instance ip address> \<port>" rerank url = base url + "/rerank" model1 = "mixedbread ai/mxbai rerank xsmall v1" input json = {"query" "where is munich?","documents" \["munich is in germany ", "the sky is blue "],"return documents" "false","model" "mixedbread ai/mxbai rerank xsmall v1"} headers = { "accept" "application/json", "content type" "application/json" } payload = { "query" input json\["query"], "documents" input json\["documents"], "return documents" input json\["return documents"], "model" model1 } response = requests post(rerank url, json=payload, headers=headers) if response status code == 200 resp json = response json() print(resp json) else print(response status code) print(response text) we can see from the output of the cell that it gives us a list of jsons for each score, in order of highest relevance therefore in this case, the first entry in the list had a relevancy of 74, meaning that it "won" the ranking of samples for this query and we'll now query the classification model classify url = base url + "/classify" model2 = "samlowe/roberta base go emotions" headers = { "accept" "application/json", "content type" "application/json" } payload = { "input" \["i am feeling really happy today"], "model" model2 } response = requests post(classify url, json=payload, headers=headers) if response status code == 200 resp json = response json() print(resp json) else print(response status code) print(response text) we can see from this that the most likely emotion from this model's choices was "joy" so there you have it, now you can see how with vast and infinity, you can serve embedding, reranking, and classifier models all from just one gpu on the most affordable compute on the market

Video Generation

Blender in the Cloud