Skip to main content
The Text Generation Inference serverless template can be used to infer LLMs on Vast GPU instances. This page documents required environment variables and endpoints to get started. A full PyWorker and Client implementation can be found here.

Environment Variables

  • HF_TOKEN(string): HuggingFace API token with read permissions, used to download gated models. Read more about HuggingFace tokens here.
  • MODEL_ID(string): ID of the model to be used for inference. Supported HuggingFace models are shown here.
Some models on HuggingFace require the user to accept the terms and conditions on their HuggingFace account before using. For such models, this must be done first before using it with a Vast template.

Endpoints

/generate/

Generates the LLM’s response to a given prompt in a single request.

Inputs

Auth_data:
  • signature(string): A cryptographic string that authenticates the url, cost, and reqnum fields in the response, proving they originated from the server. Clients can use this signature, along with the server’s public key, to verify that these specific details have not been tampered with.
  • cost(float): The estimated compute resources for the request. The units of this cost are defined by the PyWorker.
  • endpoint(string): Name of the Endpoint.
  • reqnum(int): The request number corresponding to this worker instance. Note that workers expect to receive requests in approximately the same order as these reqnums, but some flexibility is allowed due to potential out-of-order requests caused by concurrency or small delays on the proxy server.
  • url(string): The address of the worker instance to send the request to.
Payload:
  • inputs(string): The prompt message to be used as the input for the LLM.
  • Parameters:
    • max_new_tokens(int): The maximum number of tokens the model will generate for the response to the input.
JSON
{
  "auth_data": {
    "signature": "a_base64_encoded_signature_string_from_route_endpoint",
    "cost": 256,
    "endpoint": "Your-TGI-Endpoint-Name",
    "reqnum": 1234567890,
    "url": "http://worker-ip-address:port"
  },
  "payload": {
    "inputs": "What is the answer to the universe?",
    "parameters": {
      "max_new_tokens": 256
    }
  }
}
Depending on the model being used, other parameters like ‘temperature’ or ‘top_p’ may be supported. Passing in these values in parameters will forward the values to the model, but they are not required.

Outputs

  • generated_text(string): The model response to the input prompt.
JSON
[
  {
    "generated_text": "The model's response..."
  }
]

/generate_stream/

Generates and streams the LLM’s response token by token.

Inputs

/generate_stream/ takes the same inputs as /generate/: Auth_data:
  • signature(string): A cryptographic string that authenticates the url, cost, and reqnum fields in the response, proving they originated from the server. Clients can use this signature, along with the server’s public key, to verify that these specific details have not been tampered with.
  • cost(float): The estimated compute resources for the request. The units of this cost are defined by the PyWorker.
  • endpoint(string): Name of the Endpoint.
  • reqnum(int): The request number corresponding to this worker instance. Note that workers expect to receive requests in approximately the same order as these reqnums, but some flexibility is allowed due to potential out-of-order requests caused by concurrency or small delays on the proxy server.
  • url(string): The address of the worker instance to send the request to.
Payload:
  • inputs(string): The prompt message to be used as the input for the LLM.
  • Parameters:
    • max_new_tokens(int): The maximum number of tokens the model will generate for the response to the input.
The max_new_tokens parameter, rather than the prompt size, impacts performance. For example, if an instance is benchmarked to process 100 tokens per second, a request with max_new_tokens = 200 will take approximately 2 seconds to complete.
JSON
{
  "auth_data": {
    "signature": "a_base64_encoded_signature_string_from_route_endpoint",
    "cost": 256,
    "endpoint": "Your-TGI-Endpoint-Name",
    "reqnum": 1234567890,
    "url": "http://worker-ip-address:port"
  },
  "payload": {
    "inputs": "What is the answer to the universe?",
    "parameters": {
      "max_new_tokens": 256
    }
  }
}

Outputs

/generate_stream/ outputs are a stream of Server-Sent Events (SSE), which each event looking like:
JSON
{
  "token": {
    "id": 123,           // Token ID
    "text": "Hello",      // The actual text of the token
    "logprob": -0.12345,  // Log probability of the token
    "special": false      // Whether it's a special token (e.g., EOS)
  }
}