Pre-built Templates
Text Generation Inference (TGI)
10min
the text generation inference serverless template can be used to infer llms on vast gpu instances this page documents required environment variables and endpoints to get started a full {{pyworker}} and client implementation can be found here environment variables hf token (string) huggingface api token with read permissions, used to download gated models read more about huggingface tokens here model id (string) id of the model to be used for inference supported huggingface models are shown here https //huggingface co/docs/text generation inference/en/supported models some models on huggingface require the user to accept the terms and conditions on their huggingface account before using for such models, this must be done first before using it with a vast template endpoints /generate / generates the llm's response to a given prompt in a single request inputs auth data signature (string) a cryptographic string that authenticates the url, cost, and reqnum fields in the response, proving they originated from the server clients can use this signature, along with the server's public key, to verify that these specific details have not been tampered with cost (float) the estimated compute resources for the request the units of this cost are defined by the {{pyworker}} endpoint (string) name of the endpoint reqnum (int) the request number corresponding to this worker instance note that workers expect to receive requests in approximately the same order as these reqnums, but some flexibility is allowed due to potential out of order requests caused by concurrency or small delays on the proxy server url (string) the address of the worker instance to send the request to payload inputs (string) the prompt message to be used as the input for the llm parameters max new tokens (int) the maximum number of tokens the model will generate for the response to the input { "auth data" { "signature" "a base64 encoded signature string from route endpoint", "cost" 256, "endpoint" "your tgi endpoint name", "reqnum" 1234567890, "url" "http //worker ip address\ port" }, "payload" { "inputs" "what is the answer to the universe?", "parameters" { "max new tokens" 256 } } } depending on the model being used, other parameters like 'temperature' or 'top p' may be supported passing in these values in parameters will forward the values to the model, but they are not required outputs generated text (string) the model response to the input prompt \[ { "generated text" "the model's response " } ] /generate stream / generates and streams the llm's response token by token inputs /generate stream/ takes the same inputs as /generate/ auth data signature (string) a cryptographic string that authenticates the url, cost, and reqnum fields in the response, proving they originated from the server clients can use this signature, along with the server's public key, to verify that these specific details have not been tampered with cost (float) the estimated compute resources for the request the units of this cost are defined by the {{pyworker}} endpoint (string) name of the endpoint reqnum (int) the request number corresponding to this worker instance note that workers expect to receive requests in approximately the same order as these reqnums, but some flexibility is allowed due to potential out of order requests caused by concurrency or small delays on the proxy server url (string) the address of the worker instance to send the request to payload inputs (string) the prompt message to be used as the input for the llm parameters max new tokens (int) the maximum number of tokens the model will generate for the response to the input the max new tokens parameter, rather than the prompt size, impacts performance for example, if an instance is benchmarked to process 100 tokens per second, a request with max new tokens = 200 will take approximately 2 seconds to complete { "auth data" { "signature" "a base64 encoded signature string from route endpoint", "cost" 256, "endpoint" "your tgi endpoint name", "reqnum" 1234567890, "url" "http //worker ip address\ port" }, "payload" { "inputs" "what is the answer to the universe?", "parameters" { "max new tokens" 256 } } } outputs /generate stream/ outputs are a stream of server sent events (sse), which each event looking like { "token" { "id" 123, // token id "text" "hello", // the actual text of the token "logprob" 0 12345, // log probability of the token "special" false // whether it's a special token (e g , eos) } }