AsyncLLMEngine and serves text generation requests.
Deployment
Client
What This Demonstrates
- Using
@contextwith vLLM’sAsyncLLMEnginefor high-performance LLM serving - Warmup generation in
__aenter__to pre-allocate KV cache before benchmarking - Proper cleanup with
shutdown_background_loop()in__aexit__ - Using
image.use_system_python()to use the image’s built-in Python environment - Specifying exact package versions for reproducibility
- A simple benchmark dataset with a short prompt