Environment Variables
HF_TOKEN
(string): HuggingFace API token with read permissions, used to download gated models. Read more about HuggingFace tokens here.MODEL_ID
(string): ID of the model to be used for inference. Supported HuggingFace models are shown here.
Some models on HuggingFace require the user to accept the terms and conditions on their HuggingFace account before using. For such models, this must be done first before using it with a Vast template.
Endpoints
/generate/
Generates the LLM’s response to a given prompt in a single request.Inputs
Auth_data:signature
(string): A cryptographic string that authenticates the url, cost, and reqnum fields in the response, proving they originated from the server. Clients can use this signature, along with the server’s public key, to verify that these specific details have not been tampered with.cost
(float): The estimated compute resources for the request. The units of this cost are defined by the PyWorker.endpoint
(string): Name of the Endpoint.reqnum
(int): The request number corresponding to this worker instance. Note that workers expect to receive requests in approximately the same order as these reqnums, but some flexibility is allowed due to potential out-of-order requests caused by concurrency or small delays on the proxy server.url
(string): The address of the worker instance to send the request to.
inputs
(string): The prompt message to be used as the input for the LLM.- Parameters:
max_new_tokens
(int): The maximum number of tokens the model will generate for the response to the input.
JSON
parameters
will forward the values to the model, but they are not required.
Outputs
generated_text
(string): The model response to the input prompt.
JSON
/generate_stream/
Generates and streams the LLM’s response token by token.Inputs
/generate_stream/
takes the same inputs as /generate/
:
Auth_data:
signature
(string): A cryptographic string that authenticates the url, cost, and reqnum fields in the response, proving they originated from the server. Clients can use this signature, along with the server’s public key, to verify that these specific details have not been tampered with.cost
(float): The estimated compute resources for the request. The units of this cost are defined by the PyWorker.endpoint
(string): Name of the Endpoint.reqnum
(int): The request number corresponding to this worker instance. Note that workers expect to receive requests in approximately the same order as these reqnums, but some flexibility is allowed due to potential out-of-order requests caused by concurrency or small delays on the proxy server.url
(string): The address of the worker instance to send the request to.
inputs
(string): The prompt message to be used as the input for the LLM.- Parameters:
max_new_tokens
(int): The maximum number of tokens the model will generate for the response to the input.
The
max_new_tokens
parameter, rather than the prompt size, impacts performance. For example, if an instance is benchmarked to process 100 tokens per second, a request with max_new_tokens = 200
will take approximately 2 seconds to complete.JSON
Outputs
/generate_stream/
outputs are a stream of Server-Sent Events (SSE), which each event looking like:
JSON