
Primary Components
Endpoints
An Endpoint is the highest-level construct in Vast Serverless. Endpoints are configured with endpoint-level parameters that control scaling behavior, capacity limits, and utilization targets. An endpoint consists of:- A named endpoint identifier
- Typically one workergroup
- Endpoint parameters such as
max_workers,min_load,min_workers,cold_mult,min_cold_load, andtarget_util
Workergroups
A Workergroup defines what code runs on the endpoint (via the template), as well as how workers are recruited and created. Workergroups are configured with workergroup-level parameters and are responsible for selecting which GPU offers are eligible for worker creation. Each Workergroup includes:- A serverless-compatible template (referenced by
template_idortemplate_hash) - Hardware and marketplace filters defined via
search_params - Optional instance configuration overrides via
launch_args - Hardware requirements such as
gpu_ram - A set of GPU instances (workers) created from the template
Workers
Workers are individual GPU instances created and managed by the Serverless engine. Each worker runs a PyWorker, a Python web server that monitors the inference server’s readiness, proxies incoming requests, and coordinates with the autoscaler. Workers can exist in active or inactive states and are responsible for:- Receiving and processing inference requests
- Reporting performance metrics (load, utilization, benchmark results)
- Informing automated scaling and routing decisions
Serverless Engine
The Serverless Engine is the decision-making service that routes incoming requests and manages workers across all endpoints and workergroups. Using configuration parameters and real-time metrics, it determines when to:- Recruit new workers
- Activate inactive workers
- Release or destroy workers
SDK
The Serverless SDK is the primary interface for interacting with Vast Serverless. It is a Pythonpip package that abstracts low-level details and manages:
- Authentication
- Request queuing, retries, and error handling
- Asynchronous request management
- Worker status and lifecycle information
Example Workflow
- The client application sends a request using the Serverless SDK.
- The Serverless system routes the request and returns a suitable worker address based on current load and capacity.
- The client sends the request directly to the selected worker’s API endpoint, including the required authentication data.
- The PyWorker running on the GPU instance forwards the request to the machine learning model and performs inference.
- The inference result is returned to the client application.
- Independently and continuously, each PyWorker reports operational and performance metrics back to the Serverless Engine, which uses this data to make ongoing scaling decisions.