worker.py file that constructs a Worker from a WorkerConfig.
By the end of this document you will understand:
- What a PyWorker does at a high level
- How
worker.pyis launched in the serverless environment - How to configure
WorkerConfig,HandlerConfig,BenchmarkConfig, andLogActionConfig - How request parsing, response generation, workload calculation, and queueing work
- How to adapt existing “legacy” PyWorkers if you have them
This page assumes you already know how to create a Serverless Endpoint and Worker Group. It focuses only on defining
worker.py. See the Serverless Endpoint documentation for how to create endpoints and worker groups.Vast publishes pre-made templates with PyWorkers already wired up. Before writing your own
worker.py, check the templates in the documentation and control panel; they may already cover your use case.How PyWorkers and worker.py fit into Serverless
On each worker instance:-
The start-server script (provided by the template) runs.
It is responsible for:
- Cloning your repository from
PYWORKER_REPO - Installing Python dependencies from
requirements.txt - Starting your model server (e.g. vLLM)
- Running
python worker.py
- Cloning your repository from
-
worker.py:- Builds a
WorkerConfigdescribing:- How to reach your model server (
model_server_url,model_server_port,model_log_file) - Which HTTP routes the worker should handle (
handlers) - How to detect model readiness and errors (
log_action_config)
- How to reach your model server (
- Constructs
Worker(worker_config) - Calls
Worker.run(), which:- Creates a backend object
- Attaches handlers for each configured route
- Starts an HTTP server using
aiohttp
- Builds a
-
The serverless engine:
- Watches:
- Logs from your model (via
model_log_file+LogActionConfig) - Benchmarks (via
BenchmarkConfig) - Request workloads and success/error metrics
- Logs from your model (via
- Uses this information to right-size your hot (running) and cold (stopped) capacity based on current and predicted workload.
- Watches:
What a PyWorker actually does
Conceptually, PyWorker’s responsibilities are:-
Ingress proxy
- Receive HTTP requests from the Vast serverless router on routes you define (e.g.
/v1/completions,/generate). - Optionally transform and validate request bodies.
- Receive HTTP requests from the Vast serverless router on routes you define (e.g.
-
Workload tracking
- For each request, compute a workload
- Workload is a floating point number chosen by you:
- For LLMs, this is typically “number of tokens” (prompt + max output).
- For other workloads, it can be “constant 1 per request” or any cost metric that correlates with compute usage.
-
Forwarding to model server
- Forward the transformed payload to your model server at
model_server_url:model_server_port. - Handle FIFO queueing if your backend cannot process multiple requests in parallel.
- Forward the transformed payload to your model server at
-
Returning responses
- Optionally transform or wrap model responses.
- Support both standard JSON responses and streaming (SSE, NDJSON, chunked) responses.
-
Readiness, failure, and benchmarking
- Watch your model’s log file:
- Detect “model loaded” lines (
on_load) - Detect “model error” lines (
on_error)
- Detect “model loaded” lines (
- After a load signal, run benchmarks on one of your routes.
- Report effective throughput so the serverless engine can size capacity.
- Watch your model’s log file:
The worker.py structure
A PyWorker is usually a single file,worker.py, that:
- Imports the public configuration types:
- Defines any helper functions (benchmark payload generators, request parsers, response generators, workload calculators).
-
Constructs a
WorkerConfigand passes it toWorker. - Runs the worker:
WorkerConfig: configuring the model backend
WorkerConfig tells the PyWorker how to talk to your model server and which routes to expose.
Typical usage:
Required fields
-
model_server_url: strBase URL where your model server is listening (e.g."http://127.0.0.1"). -
model_server_port: intPort of the model server (e.g.18000). -
model_log_file: strPath to the model’s log file on disk. The PyWorker tails this file to:- Detect when the model has loaded (
on_load) - Detect unrecoverable errors (
on_error) - Report informative events (
on_info)
- Detect when the model has loaded (
-
handlers: list[HandlerConfig]OneHandlerConfigper HTTP route your PyWorker should expose.
LogActionConfig: mapping log lines to state changes
LogActionConfig is where you teach PyWorker how to interpret log lines from your model server:
- Matching is prefix-based and case-sensitive:
- A log line is considered a match if it starts with one of your strings exactly.
on_load:- On the first match of any
on_loadprefix, the worker knows the model is “loaded” and can begin benchmarking.
- On the first match of any
on_error:- On the first match, the worker goes into an errored state.
- The serverless engine will treat this as a failed worker and trigger a restart.
on_info:- Used for metrics and observability only; they do not change worker state.
- The file at
model_log_fileshould contain logs for the current run of the worker, not the entire machine lifetime. - The template should rotate logs per worker start so the PyWorker is not tailing stale history.
HandlerConfig: configuring routes and per-endpoint behavior
EachHandlerConfig describes how a single HTTP route behaves:
- Which path it handles (e.g.
/v1/completions) - Whether requests are processed in parallel or serialized
- How to compute workload from a request
- How to generate benchmark payloads for this route
- Optional hooks for parsing requests and generating responses
- Optional legacy integration with existing
EndpointHandler/ApiPayloadclasses
Route and basic queueing
-
route: strPath to expose on the PyWorker HTTP server. For example:/v1/completions/v1/chat/completions/generate
-
allow_parallel_requests: boolControls whether the PyWorker performs internal queueing:-
False(default):- PyWorker enforces strict FIFO queueing to the model server.
- At most one in-flight request is sent to the model backend at a time for this handler.
- This is appropriate when the model server itself is single-threaded or cannot handle parallel requests.
-
True:- PyWorker forwards requests directly and lets the model backend or serverless engine handle parallelism.
- Use this for backends that support parallel processing (e.g. vLLM).
-
-
max_queue_time: float | NoneMaximum time (in seconds) a request is allowed to remain queued inside the PyWorker before being processed.- If a queued request waits longer than
max_queue_time:- PyWorker responds to the client with HTTP 429 (Too Many Requests).
- The error is recorded in metrics and logs.
- The client SDK will automatically retry your request later.
- If a queued request waits longer than
Workload calculation
-
workload_calculator: Callable[[dict], float] | NoneDefines how much workload (a float) this request represents. This is the key input to autoscaling.- Input:
- A dict representing the model payload (the same dict forwarded to your model server).
- Output:
- A
floatrepresenting workload; larger means “more expensive.”
- A
Behavior on errors:- If
workload_calculatorraises an exception:- The request fails.
- PyWorker logs the error and returns HTTP 500 to the client.
- Input:
Request parsing: request_parser
-
request_parser: Callable[[dict], dict] | NoneOptional hook to transform the incoming JSON request into the payload that will be forwarded to the model backend. Key points:- Input:
- The raw JSON body received by PyWorker (already parsed into a dict).
- Output:
- A dict representing the model payload.
- PyWorker will then use this dict as the internal payload and forward it to your model server as JSON.
-
Simple pass-through (no parser):
- If you do not provide
request_parser, PyWorker forwards the incoming JSON as-is to the model backend. - The same dict is used for workload calculations.
- If you do not provide
-
Shape transformation:
- Translate “public API” shape into “backend API” shape:
- Translate “public API” shape into “backend API” shape:
-
Validation and light on-request hooks:
- Validate fields and, if needed, mutate the dict in place:
- Validate fields and, if needed, mutate the dict in place:
- Any exception raised in
request_parser:- Is logged for the instance.
- Marks the request as errored.
- The client receives HTTP 500.
- Input:
Response handling: response_generator
-
response_generator: Callable[[web.Request, ClientResponse], Awaitable[web.StreamResponse | web.Response]] | NoneOptional hook to transform the model server response into the final client response.- Input:
client_request: the originalaiohttp.web.Requestfrom the client.model_response: theaiohttp.ClientResponsefrom the model server.
- Output:
- An
aiohttp.web.Responseoraiohttp.web.StreamResponse.
- An
Behavior:- If you define
response_generator, PyWorker calls it and uses the result directly. - If your
response_generatorraises an exception:- PyWorker logs the error.
- The client receives HTTP 500.
- Input:
Default response behavior (no response_generator)
If you do not specify aresponse_generator, PyWorker provides a reasonable default:
-
It detects streaming responses based on:
Content-Typestarting withtext/event-streamContent-Typeequal toapplication/x-ndjsonorapplication/jsonlContent-Typecontaining"stream"(case-insensitive)Transfer-Encoding: chunked
-
If the response is streaming:
- PyWorker creates a
web.StreamResponse. - Copies the appropriate
content_type. - Streams chunks from the model server to the client as they arrive.
- PyWorker creates a
-
If the response is not streaming:
- PyWorker reads the full body from
model_response. - Returns a
web.Responsewith:- The same status code.
- The same
Content-Type. - All headers except
Content-Type(which is set directly).
- PyWorker reads the full body from
BenchmarkConfig: measuring performance
Benchmarks run once the worker detects a model load signal viaon_load. They are central to how the serverless engine learns the capacity of each worker.
A BenchmarkConfig is attached to exactly one handler:
- You must configure exactly one
HandlerConfigwith aBenchmarkConfig.- PyWorker enforces that only one handler can be the benchmark handler.
- Benchmarks start:
- PyWorker sees an
on_loadlog line from your model. - It then runs the benchmark on the handler with
BenchmarkConfig.
- PyWorker sees an
- The worker becomes ready only after the benchmark finishes successfully.
- If benchmark runs fail (e.g. errors, timeouts), the worker is treated as errored and will be restarted by the serverless engine.
Benchmark payloads
You can provide benchmark payloads via:dataset: list[dict]- A literal list of payloads. PyWorker selects entries (e.g. at random) to send to the model server.
generator: Callable[[], dict]- A function that returns one payload dict each time it is called.
- Pick one of
datasetorgenerator(do not rely on precedence between them). - Make benchmark payloads representative of your “typical” requests:
- If most traffic is small, do not benchmark only with huge prompts.
- If traffic is mixed, choose a representative distribution.
Runs and concurrency
-
runs: intNumber of benchmark rounds. -
concurrency: intNumber of concurrent requests per run ifallow_parallel_requests=True.- If
allow_parallel_requests=False:- Effective concurrency is clamped; your backend will process benchmark requests serially despite a larger
concurrencyvalue.
- Effective concurrency is clamped; your backend will process benchmark requests serially despite a larger
- If
Autoscaling and workload (conceptual overview)
PyWorker does not expose the full autoscaling algorithm, but conceptually:- Each request is assigned a workload (a float) by your
workload_calculator. - Benchmarks estimate how many units of workload per second a worker can handle on a given handler.
- At runtime, the serverless engine:
- Tracks workload being requested by clients.
- Tracks workload being processed by each worker.
- Adjusts:
- Hot capacity (running workers ready to serve)
- Cold capacity (stopped workers that can be started quickly)
- To “right size” capacity to match current and predicted workload.
- Workload ≈ prompt tokens + expected output tokens (or just
max_tokensas a simpler proxy).
- Set a constant workload per request (e.g.
100.0) so effective capacity is “requests per second”.
Example: vLLM-style worker.py
Below is a completeworker.py for a vLLM-style model server that exposes:
/v1/completions/v1/chat/completions
- Treat
max_tokensas the workload metric. - Allow parallel requests.
- Use a benchmark generator that builds random prompts.
How requests and responses behave end-to-end
Putting the pieces together, a typical request/response flow looks like this:-
Client calls your Serverless Endpoint on one of your routes, e.g.
POST /v1/completionswith JSON body: -
The Serverless router forwards this to the appropriate PyWorker instance’s
/v1/completionsroute. -
The
HandlerConfigfor/v1/completions:- Optionally runs
request_parser(if configured) to transform the request. - Runs
workload_calculatorto compute workload. - Either:
- Queues the request (FIFO) if
allow_parallel_requests=False, or - Forwards it immediately to the model backend if
True.
- Queues the request (FIFO) if
- Optionally runs
-
PyWorker sends the request payload (as JSON) to your model server at
model_server_url:model_server_port. -
When the model responds:
- If you defined
response_generator, PyWorker calls it and returns its result. - Otherwise, PyWorker:
- Detects whether the response is streaming or not.
- Either pipes the stream to the client or returns a standard JSON response.
- If you defined
-
Any exceptions in parsing, forwarding, or response handling:
- Are logged in the worker’s logs.
- Produce an HTTP 500 response to the client.
Legacy support: existing EndpointHandler / ApiPayload implementations
If you have existing PyWorkers implemented using the older pattern (server.py, data_types.py, EndpointHandler, ApiPayload), you can still run them under the new Worker abstraction by using two escape hatches in HandlerConfig:
handler_class: Type[EndpointHandler]payload_class: Type[ApiPayload]
-
When
handler_classis provided:- PyWorker instantiates your
EndpointHandlerdirectly. - The factory does not apply other
HandlerConfigfields to it. - Queueing, workload calculation, and payload handling are all controlled by your legacy class.
- PyWorker instantiates your
-
This mechanism exists primarily for backward compatibility:
- It lets you keep old workers running while Vast evolves the SDK.
- For new projects, we strongly recommend using the modern WorkerConfig + HandlerConfig + BenchmarkConfig + LogActionConfig approach rather than implementing
EndpointHandlerandApiPayloaddirectly.
Linking worker.py to your Serverless Endpoint
Finally, to make Vast actually use yourworker.py:
- Put
worker.pyandrequirements.txtat the root of a public Git repository. - In your Serverless template configuration:
- Set the environment variable
PYWORKER_REPOto that Git repo URL.
- Set the environment variable
- The start-server script on each worker will:
- Clone
PYWORKER_REPO. - Install
requirements.txt. - Start your model server.
- Run
python worker.py.
- Clone
- Your worker instances will:
- Tail the model log file.
- Wait for
on_loadlogs. - Run benchmarks on the configured benchmark handler.
- Join the ready pool once benchmarking completes successfully.
worker.py implementation.