Creating Custom PyWorkers

Vast’s PyWorker is a Python HTTP proxy that sits between the Vast serverless routing layer and your model server (e.g. vLLM, TGI, ComfyUI). The modern implementation is centered around a single worker.py file that constructs a Worker from a WorkerConfig. By the end of this document you will understand:

What a PyWorker does at a high level
How worker.py is launched in the serverless environment
How to configure WorkerConfig, HandlerConfig, BenchmarkConfig, and LogActionConfig
How request parsing, response generation, workload calculation, and queueing work
How to adapt existing “legacy” PyWorkers if you have them

This page assumes you already know how to create a Serverless Endpoint and Worker Group. It focuses only on defining worker.py. See the Serverless Endpoint documentation for how to create endpoints and worker groups.

Vast publishes pre-made templates with PyWorkers already wired up. Before writing your own worker.py, check the templates in the documentation and control panel; they may already cover your use case.

How PyWorkers and worker.py fit into Serverless

On each worker instance:

The start-server script (provided by the template) runs. It is responsible for:
- Cloning your repository from PYWORKER_REPO
- Installing Python dependencies from requirements.txt
- Starting your model server (e.g. vLLM)
- Running python worker.py
worker.py:
- Builds a WorkerConfig describing:
  - How to reach your model server (model_server_url, model_server_port, model_log_file)
  - Which HTTP routes the worker should handle (handlers)
  - How to detect model readiness and errors (log_action_config)
- Constructs Worker(worker_config)
- Calls Worker.run(), which:
  - Creates a backend object
  - Attaches handlers for each configured route
  - Starts an HTTP server using aiohttp
The serverless engine:
- Watches:
  - Logs from your model (via model_log_file + LogActionConfig)
  - Benchmarks (via BenchmarkConfig)
  - Request workloads and success/error metrics
- Uses this information to right-size your hot (running) and cold (stopped) capacity based on current and predicted workload.

What a PyWorker actually does

Conceptually, PyWorker’s responsibilities are:

Ingress proxy
- Receive HTTP requests from the Vast serverless router on routes you define (e.g. /v1/completions, /generate).
- Optionally transform and validate request bodies.
Workload tracking
- For each request, compute a workload
- Workload is a floating point number chosen by you:
  - For LLMs, this is typically “number of tokens” (prompt + max output).
  - For other workloads, it can be “constant 1 per request” or any cost metric that correlates with compute usage.
Forwarding to model server
- Forward the transformed payload to your model server at model_server_url:model_server_port.
- Handle FIFO queueing if your backend cannot process multiple requests in parallel.
Returning responses
- Optionally transform or wrap model responses.
- Support both standard JSON responses and streaming (SSE, NDJSON, chunked) responses.
Readiness, failure, and benchmarking
- Watch your model’s log file:
  - Detect “model loaded” lines (on_load)
  - Detect “model error” lines (on_error)
- After a load signal, run benchmarks on one of your routes.
- Report effective throughput so the serverless engine can size capacity.

The worker.py structure

A PyWorker is usually a single file, worker.py, that:

Imports the public configuration types:

from vastai import (
    Worker,
    WorkerConfig,
    HandlerConfig,
    BenchmarkConfig,
    LogActionConfig,
)

Defines any helper functions (benchmark payload generators, request parsers, response generators, workload calculators).
Constructs a WorkerConfig and passes it to Worker.
Runs the worker:

Worker(worker_config).run()

That’s the entire required structure.

WorkerConfig: configuring the model backend

WorkerConfig tells the PyWorker how to talk to your model server and which routes to expose. Typical usage:

from vastai import Worker, WorkerConfig, HandlerConfig, BenchmarkConfig, LogActionConfig

MODEL_SERVER_URL  = "http://127.0.0.1"
MODEL_SERVER_PORT = 18000
MODEL_LOG_FILE    = "/var/log/model/server.log"

worker_config = WorkerConfig(
    # --- Model config ---
    model_server_url=MODEL_SERVER_URL,
    model_server_port=MODEL_SERVER_PORT,
    model_log_file=MODEL_LOG_FILE,

    # --- Route handlers ---
    handlers=[
        # HandlerConfig(...) entries – see next section
    ],

    # --- Log actions ---
    log_action_config=LogActionConfig(
        on_load=[
            "Application startup complete.",
        ],
        on_error=[
            "RuntimeError: Engine",
            "Traceback (most recent call last):",
        ],
        on_info=[
            '"message":"Download',
        ],
    ),
)

Worker(worker_config).run()

Required fields

model_server_url: str Base URL where your model server is listening (e.g. "http://127.0.0.1").
model_server_port: int Port of the model server (e.g. 18000).
model_log_file: str Path to the model’s log file on disk. The PyWorker tails this file to:
- Detect when the model has loaded (on_load)
- Detect unrecoverable errors (on_error)
- Report informative events (on_info)
handlers: list[HandlerConfig] One HandlerConfig per HTTP route your PyWorker should expose.

LogActionConfig: mapping log lines to state changes

LogActionConfig is where you teach PyWorker how to interpret log lines from your model server:

from vastai import LogActionConfig

log_action_config = LogActionConfig(
    on_load=[
        # Prefixes that indicate the model is fully loaded and ready
        "Application startup complete.",
    ],
    on_error=[
        # Prefixes that indicate irrecoverable failures
        "INFO exited: vllm",
        "RuntimeError: Engine",
        "Traceback (most recent call last):",
    ],
    on_info=[
        # Prefixes for useful “informational only” logs
        '"message":"Download',
    ],
)

Key semantics:

Matching is prefix-based and case-sensitive:
- A log line is considered a match if it starts with one of your strings exactly.
on_load:
- On the first match of any on_load prefix, the worker knows the model is “loaded” and can begin benchmarking.
on_error:
- On the first match, the worker goes into an errored state.
- The serverless engine will treat this as a failed worker and trigger a restart.
on_info:
- Used for metrics and observability only; they do not change worker state.

Log file expectations:

The file at model_log_file should contain logs for the current run of the worker, not the entire machine lifetime.
The template should rotate logs per worker start so the PyWorker is not tailing stale history.

HandlerConfig: configuring routes and per-endpoint behavior

Each HandlerConfig describes how a single HTTP route behaves:

Which path it handles (e.g. /v1/completions)
Whether requests are processed in parallel or serialized
How to compute workload from a request
How to generate benchmark payloads for this route
Optional hooks for parsing requests and generating responses
Optional legacy integration with existing EndpointHandler/ApiPayload classes

A minimal handler:

from vastai import HandlerConfig

completions_handler = HandlerConfig(
    route="/v1/completions",
    allow_parallel_requests=True,
    max_queue_time=60.0,
    workload_calculator=lambda payload: float(payload.get("max_tokens", 0)),
    benchmark_config=BenchmarkConfig(
        generator=completions_benchmark_generator,
        runs=16,
        concurrency=100,
    ),
)

Route and basic queueing

route: str Path to expose on the PyWorker HTTP server. For example:
- /v1/completions
- /v1/chat/completions
- /generate
allow_parallel_requests: bool Controls whether the PyWorker performs internal queueing:
- False (default):
  - PyWorker enforces strict FIFO queueing to the model server.
  - At most one in-flight request is sent to the model backend at a time for this handler.
  - This is appropriate when the model server itself is single-threaded or cannot handle parallel requests.
- True:
  - PyWorker forwards requests directly and lets the model backend or serverless engine handle parallelism.
  - Use this for backends that support parallel processing (e.g. vLLM).
max_queue_time: float | None Maximum time (in seconds) a request is allowed to remain queued inside the PyWorker before being processed.
- If a queued request waits longer than max_queue_time:
  - PyWorker responds to the client with HTTP 429 (Too Many Requests).
  - The error is recorded in metrics and logs.
  - The client SDK will automatically retry your request later.

Workload calculation

workload_calculator: Callable[[dict], float] | None Defines how much workload (a float) this request represents. This is the key input to autoscaling.
- Input:
  - A dict representing the model payload (the same dict forwarded to your model server).
- Output:
  - A float representing workload; larger means “more expensive.”
- Recommendation:
  - For applications where requests do not vary much in complexity, returning a constant value (i.e. 100) is often sufficient.
  - For applications that have significant variation in work complexity from request to request, it is best to use a workload calculation that reflects the variability. For image generation, that could be the number of pixels generated (h x w). For video, that could be the number of frames generated x the size of the frame.
Examples:
```
# LLM: approximate cost as max_tokens only
workload_calculator=lambda payload: float(payload.get("max_tokens", 0))

# LLM: prompt tokens + expected output tokens
def llm_workload(payload: dict) -> float:
    prompt = payload.get("prompt", "")
    max_tokens = payload.get("max_tokens", 0)
    # Very simple proxy: character-based length
    prompt_tokens = len(prompt) / 4.0
    return prompt_tokens + max_tokens

# Constant cost per request
workload_calculator=lambda payload: 100.0
```
Behavior on errors:
- If workload_calculator raises an exception:
  - The request fails.
  - PyWorker logs the error and returns HTTP 500 to the client.

Request parsing: request_parser

request_parser: Callable[[dict], dict] | None Optional hook to transform the incoming JSON request into the payload that will be forwarded to the model backend. Key points:
- Input:
  - The raw JSON body received by PyWorker (already parsed into a dict).
- Output:
  - A dict representing the model payload.
  - PyWorker will then use this dict as the internal payload and forward it to your model server as JSON.
Intended usage patterns:
- Simple pass-through (no parser):
  - If you do not provide request_parser, PyWorker forwards the incoming JSON as-is to the model backend.
  - The same dict is used for workload calculations.
- Shape transformation:
  - Translate “public API” shape into “backend API” shape:
    def my_request_parser(json_msg: dict) -> dict: # Client sends: {"prompt": "...", "max_tokens": 128} # Backend expects: {"input_text": "...", "limit": 128} return { "input_text": json_msg["prompt"], "limit": json_msg.get("max_tokens", 0), }
- Validation and light on-request hooks:
  - Validate fields and, if needed, mutate the dict in place:
    def guarded_parser(json_msg: dict) -> dict: if "prompt" not in json_msg: raise ValueError("prompt is required") json_msg.setdefault("max_tokens", 256) return json_msg
Behavior on errors:
- Any exception raised in request_parser:
  - Is logged for the instance.
  - Marks the request as errored.
  - The client receives HTTP 500.

Response handling: response_generator

response_generator: Callable[[web.Request, ClientResponse], Awaitable[web.StreamResponse | web.Response]] | None Optional hook to transform the model server response into the final client response.
- Input:
  - client_request: the original aiohttp.web.Request from the client.
  - model_response: the aiohttp.ClientResponse from the model server.
- Output:
  - An aiohttp.web.Response or aiohttp.web.StreamResponse.
Example: simple JSON pass-through with custom header:
```
from aiohttp import web, ClientResponse
from typing import Union

async def custom_response_generator(
    client_request: web.Request,
    model_response: ClientResponse,
) -> Union[web.Response, web.StreamResponse]:
    data = await model_response.read()
    return web.Response(
        body=data,
        status=model_response.status,
        content_type=model_response.content_type,
        headers={"X-Worker": "my-custom-pyworker"},
    )
```
Behavior:
- If you define response_generator, PyWorker calls it and uses the result directly.
- If your response_generator raises an exception:
  - PyWorker logs the error.
  - The client receives HTTP 500.

Default response behavior (no response_generator)

If you do not specify a response_generator, PyWorker provides a reasonable default:

It detects streaming responses based on:
- Content-Type starting with text/event-stream
- Content-Type equal to application/x-ndjson or application/jsonl
- Content-Type containing "stream" (case-insensitive)
- Transfer-Encoding: chunked
If the response is streaming:
- PyWorker creates a web.StreamResponse.
- Copies the appropriate content_type.
- Streams chunks from the model server to the client as they arrive.
If the response is not streaming:
- PyWorker reads the full body from model_response.
- Returns a web.Response with:
  - The same status code.
  - The same Content-Type.
  - All headers except Content-Type (which is set directly).

In both paths, PyWorker logs successes and errors and updates internal metrics.

BenchmarkConfig: measuring performance

Benchmarks run once the worker detects a model load signal via on_load. They are central to how the serverless engine learns the capacity of each worker. A BenchmarkConfig is attached to exactly one handler:

from vastai import BenchmarkConfig

benchmark_config = BenchmarkConfig(
    # Choose exactly one of dataset OR generator
    dataset=[
        {"model": "my-llm", "prompt": "hello world", "max_tokens": 128},
        {"model": "my-llm", "prompt": "another prompt", "max_tokens": 256},
    ],
    # OR
    # generator=completions_benchmark_generator,

    runs=16,
    concurrency=100,
)

Attach it to a handler:

HandlerConfig(
    route="/v1/completions",
    allow_parallel_requests=True,
    workload_calculator=lambda payload: payload.get("max_tokens", 0),
    benchmark_config=benchmark_config,
)

Key semantics:

You must configure exactly one HandlerConfig with a BenchmarkConfig.
- PyWorker enforces that only one handler can be the benchmark handler.
Benchmarks start:
1. PyWorker sees an on_load log line from your model.
2. It then runs the benchmark on the handler with BenchmarkConfig.
The worker becomes ready only after the benchmark finishes successfully.
- If benchmark runs fail (e.g. errors, timeouts), the worker is treated as errored and will be restarted by the serverless engine.

Benchmark payloads

You can provide benchmark payloads via:

dataset: list[dict]
- A literal list of payloads. PyWorker selects entries (e.g. at random) to send to the model server.
generator: Callable[[], dict]
- A function that returns one payload dict each time it is called.

For clarity and maintainability:

Pick one of dataset or generator (do not rely on precedence between them).
Make benchmark payloads representative of your “typical” requests:
- If most traffic is small, do not benchmark only with huge prompts.
- If traffic is mixed, choose a representative distribution.

Runs and concurrency

runs: int Number of benchmark rounds.
concurrency: int Number of concurrent requests per run if allow_parallel_requests=True.
- If allow_parallel_requests=False:
  - Effective concurrency is clamped; your backend will process benchmark requests serially despite a larger concurrency value.

The serverless engine uses the observed throughput (workload completed per unit time) to estimate capacity. Your chosen workload function and these benchmark settings directly influence how it sizes hot and cold capacity.

Autoscaling and workload (conceptual overview)

PyWorker does not expose the full autoscaling algorithm, but conceptually:

Each request is assigned a workload (a float) by your workload_calculator.
Benchmarks estimate how many units of workload per second a worker can handle on a given handler.
At runtime, the serverless engine:
- Tracks workload being requested by clients.
- Tracks workload being processed by each worker.
- Adjusts:
  - Hot capacity (running workers ready to serve)
  - Cold capacity (stopped workers that can be started quickly)
- To “right size” capacity to match current and predicted workload.

For LLMs, we recommend:

Workload ≈ prompt tokens + expected output tokens (or just max_tokens as a simpler proxy).

For other workloads, a common approach is:

Set a constant workload per request (e.g. 100.0) so effective capacity is “requests per second”.

Example: vLLM-style worker.py

Below is a complete worker.py for a vLLM-style model server that exposes:

/v1/completions
/v1/chat/completions

Both endpoints:

Treat max_tokens as the workload metric.
Allow parallel requests.
Use a benchmark generator that builds random prompts.

import os
import random

import nltk

from vastai import (
    Worker,
    WorkerConfig,
    HandlerConfig,
    LogActionConfig,
    BenchmarkConfig,
)

# --- Model configuration ------------------------------------------------------

MODEL_SERVER_URL  = "http://127.0.0.1"
MODEL_SERVER_PORT = 18000
MODEL_LOG_FILE    = "/var/log/portal/vllm.log"

# vLLM-specific log messages
MODEL_LOAD_LOG_MSG = [
    "Application startup complete.",
]

MODEL_ERROR_LOG_MSGS = [
    "INFO exited: vllm",
    "RuntimeError: Engine",
    "Traceback (most recent call last):",
]

MODEL_INFO_LOG_MSGS = [
    '"message":"Download',
]

# --- Benchmark data generation -----------------------------------------------

# For this example we use NLTK's word list to create random prompts
nltk.download("words")
WORD_LIST = nltk.corpus.words.words()

def completions_benchmark_generator() -> dict:
    """Generate one benchmark payload for the /v1/completions endpoint.
    This shape should match what your vLLM server expects.
    """
    prompt = " ".join(random.choices(WORD_LIST, k=int(250)))

    model = os.environ.get("MODEL_NAME")
    if not model:
        raise ValueError("MODEL_NAME environment variable not set")

    return {
        "model": model,
        "prompt": prompt,
        "temperature": 0.7,
        "max_tokens": 500,
    }

# --- Worker configuration -----------------------------------------------------

worker_config = WorkerConfig(
    model_server_url=MODEL_SERVER_URL,
    model_server_port=MODEL_SERVER_PORT,
    model_log_file=MODEL_LOG_FILE,

    handlers=[
        # /v1/completions: also used as the benchmark handler
        HandlerConfig(
            route="/v1/completions",

            # Allow vLLM to schedule parallel requests internally
            allow_parallel_requests=True,

            # Maximum time a request may sit in any internal queue before being rejected
            max_queue_time=60.0,

            # Workload: use max_tokens as a simple cost proxy
            workload_calculator=lambda payload: float(payload.get("max_tokens", 0)),

            benchmark_config=BenchmarkConfig(
                # Use our generator to produce payloads
                generator=completions_benchmark_generator,
                runs=8,
                concurrency=10,
            ),
        ),

        # /v1/chat/completions: similar behavior but no benchmark_config
        HandlerConfig(
            route="/v1/chat/completions",
            allow_parallel_requests=True,
            max_queue_time=60.0,
            workload_calculator=lambda payload: float(payload.get("max_tokens", 0)),
        ),
    ],

    log_action_config=LogActionConfig(
        on_load=MODEL_LOAD_LOG_MSG,
        on_error=MODEL_ERROR_LOG_MSGS,
        on_info=MODEL_INFO_LOG_MSGS,
    ),
)

# Run the worker synchronously
Worker(worker_config).run()

# Or run asynchronously if you need to do other Python work:
# import asyncio
# asyncio.run(Worker(worker_config).run_async())

How requests and responses behave end-to-end

Putting the pieces together, a typical request/response flow looks like this:

Client calls your Serverless Endpoint on one of your routes, e.g. POST /v1/completions with JSON body:

 {
     "model": "Qwen/Qwen3-8B",
     "prompt" : "What is 2 + 2?",
     "max_tokens" : 128,
     "temperature" : 0.7
 }

The Serverless router forwards this to the appropriate PyWorker instance’s /v1/completions route.
The HandlerConfig for /v1/completions:
- Optionally runs request_parser (if configured) to transform the request.
- Runs workload_calculator to compute workload.
- Either:
  - Queues the request (FIFO) if allow_parallel_requests=False, or
  - Forwards it immediately to the model backend if True.
PyWorker sends the request payload (as JSON) to your model server at model_server_url:model_server_port.
When the model responds:
- If you defined response_generator, PyWorker calls it and returns its result.
- Otherwise, PyWorker:
  - Detects whether the response is streaming or not.
  - Either pipes the stream to the client or returns a standard JSON response.
Any exceptions in parsing, forwarding, or response handling:
- Are logged in the worker’s logs.
- Produce an HTTP 500 response to the client.

Legacy support: existing EndpointHandler / ApiPayload implementations

If you have existing PyWorkers implemented using the older pattern (server.py, data_types.py, EndpointHandler, ApiPayload), you can still run them under the new Worker abstraction by using two escape hatches in HandlerConfig:

handler_class: Type[EndpointHandler]
payload_class: Type[ApiPayload]

Example:

from vastai import Worker, WorkerConfig, HandlerConfig, LogActionConfig
from my_legacy_worker.server import GenerateHandler  # Your existing EndpointHandler

worker_config = WorkerConfig(
    model_server_url="http://127.0.0.1",
    model_server_port=5001,
    model_log_file="/var/log/legacy_model.log",
    handlers=[
        HandlerConfig(
            route="/generate",
            handler_class=GenerateHandler,  # Use your existing handler directly
        ),
    ],
    log_action_config=LogActionConfig(
        on_load=["infer server has started"],
        on_error=["Exception: corrupted model file"],
        on_info=['"message":"Download'],
    ),
)

Worker(worker_config).run()

Important notes:

When handler_class is provided:
- PyWorker instantiates your EndpointHandler directly.
- The factory does not apply other HandlerConfig fields to it.
- Queueing, workload calculation, and payload handling are all controlled by your legacy class.
This mechanism exists primarily for backward compatibility:
- It lets you keep old workers running while Vast evolves the SDK.
- For new projects, we strongly recommend using the modern WorkerConfig + HandlerConfig + BenchmarkConfig + LogActionConfig approach rather than implementing EndpointHandler and ApiPayload directly.

This keeps the maintenance burden on the Vast SDK rather than on your own internal abstraction layer.

Linking worker.py to your Serverless Endpoint

Finally, to make Vast actually use your worker.py:

Put worker.py and requirements.txt at the root of a public Git repository.
In your Serverless template configuration:
- Set the environment variable PYWORKER_REPO to that Git repo URL.
The start-server script on each worker will:
- Clone PYWORKER_REPO.
- Install requirements.txt.
- Start your model server.
- Run python worker.py.

Once deployed:

Your worker instances will:
- Tail the model log file.
- Wait for on_load logs.
- Run benchmarks on the configured benchmark handler.
- Join the ready pool once benchmarking completes successfully.

At that point, your Serverless Endpoint is fully backed by your custom worker.py implementation.

Get Started

Instances

Serverless

Templates

Reference

How PyWorkers and worker.py fit into Serverless

What a PyWorker actually does

The worker.py structure

WorkerConfig: configuring the model backend

Required fields

LogActionConfig: mapping log lines to state changes

HandlerConfig: configuring routes and per-endpoint behavior

Route and basic queueing

Workload calculation

Request parsing: request_parser

Response handling: response_generator

Default response behavior (no response_generator)

BenchmarkConfig: measuring performance

Benchmark payloads

Runs and concurrency

Autoscaling and workload (conceptual overview)

Example: vLLM-style worker.py

How requests and responses behave end-to-end

Legacy support: existing EndpointHandler / ApiPayload implementations

Linking worker.py to your Serverless Endpoint

Get Started

Instances

Serverless

Templates

Reference

​How PyWorkers and worker.py fit into Serverless

​What a PyWorker actually does

​The worker.py structure

​WorkerConfig: configuring the model backend

​Required fields

​LogActionConfig: mapping log lines to state changes

​HandlerConfig: configuring routes and per-endpoint behavior

​Route and basic queueing

​Workload calculation

​Request parsing: request_parser

​Response handling: response_generator

​Default response behavior (no response_generator)

​BenchmarkConfig: measuring performance

​Benchmark payloads

​Runs and concurrency

​Autoscaling and workload (conceptual overview)

​Example: vLLM-style worker.py

​How requests and responses behave end-to-end

​Legacy support: existing EndpointHandler / ApiPayload implementations

​Linking worker.py to your Serverless Endpoint

How PyWorkers and worker.py fit into Serverless

What a PyWorker actually does

The worker.py structure

WorkerConfig: configuring the model backend

Required fields

LogActionConfig: mapping log lines to state changes

HandlerConfig: configuring routes and per-endpoint behavior

Route and basic queueing

Workload calculation

Request parsing: request_parser

Response handling: response_generator

Default response behavior (no response_generator)

BenchmarkConfig: measuring performance

Benchmark payloads

Runs and concurrency

Autoscaling and workload (conceptual overview)

Example: vLLM-style worker.py

How requests and responses behave end-to-end

Legacy support: existing EndpointHandler / ApiPayload implementations

Linking worker.py to your Serverless Endpoint