NVIDIA Nemotron 3 Super is a 120B parameter model that only activates 12B parameters per token, which means you get the quality of a much larger model at a fraction of the compute. It uses a novel hybrid architecture — Mamba-2 for fast sequence processing, Transformer attention where precision matters, and a Latent Mixture-of-Experts layer for efficient routing — and supports context windows up to 1M tokens. The model is particularly interesting because it ships with a built-in reasoning toggle. You can turn reasoning on for complex tasks like math and coding, switch to a low-effort mode for lighter thinking, or turn it off entirely for fast direct answers — all from the same deployment, controlled per request. It also supports Multi-Token Prediction for faster inference through speculative decoding, and performs well on agentic benchmarks involving tool use and multi-step task execution. This guide deploys the FP8 variant on Vast.ai using SGLang and queries it via the OpenAI-compatible API.Documentation Index
Fetch the complete documentation index at: https://docs.vast.ai/llms.txt
Use this file to discover all available pages before exploring further.
Prerequisites
Before getting started, you’ll need:- A Vast.ai account with credits (Sign up here)
- Vast.ai CLI installed (
pip install vastai) - Your Vast.ai API key configured
- Python 3.8+ (for the OpenAI SDK examples)
Get your API key from the Vast.ai account page and set it with
vastai set api-key YOUR_API_KEY.Understanding Nemotron 3 Super
Key capabilities:- Efficient MoE Architecture: 120B total parameters, only 12B active per token
- Hybrid Layers: Mamba-2 (linear-time) + Transformer attention + Latent MoE
- Reasoning Toggle: On, off, or low-effort modes via
chat_template_kwargs - Long Context: Up to 1M tokens (256K default)
- Commercial License: NVIDIA Nemotron Open Model License
Hardware Requirements
The FP8 variant requires:- GPUs: 2× H100-80GB. NVIDIA’s model card lists H100, H200, and GB200 as supported.
- Disk Space: 200GB minimum (model is ~120GB)
- CUDA Version: 12.4 or higher
- Docker Image:
lmsysorg/sglang:v0.5.11(Nemotron-3-Super support landed in v0.5.10)
Prefer H100 SXM variants when available — NVLink improves multi-GPU throughput over PCIe — but the FP8 model works on any Hopper-class or newer GPU pair with ≥80 GB VRAM each.
Instance Configuration
Step 1: Search for Suitable Instances
Bash
- 2× H100 SXM GPUs with at least 80GB VRAM each
- CUDA 12.4 or higher
- At least 200GB disk space
- Direct port access for the API endpoint
- High download speed for faster model loading
- Sorted by price (lowest first)
Step 2: Create the Instance
Select an instance ID from the search results and deploy:Bash
--image lmsysorg/sglang:v0.5.11— first stable SGLang line that ships Nemotron-3-Super support (added in v0.5.10)--env '-p 5000:5000'— Expose port 5000 for the API endpoint--disk 200— 200GB for the ~120GB model weights plus overhead--tp 2 --ep 2— Tensor and expert parallelism across both GPUs (NVIDIA’s reference command uses--tp 4 --ep 4on 4 GPUs; scale these together with the GPU count)--kv-cache-dtype fp8_e4m3— FP8 KV cache for efficient memory usage--tool-call-parser qwen3_coder— Parses tool-call output (Nemotron 3 Super uses the Qwen3-Coder tool-call format)--reasoning-parser nemotron_3— Parses the model’s thinking-vs-answer split when reasoning is enabled--trust-remote-code— Required for the custom Nemotron architecture
Monitoring Deployment
Check Deployment Status
Bash
Text
Get Your Endpoint
Once deployment completes, get your instance details:Bash
ports field — it maps internal port 5000 to an external port. Your API endpoint will be:
Text
Using the Nemotron 3 Super API
Quick Test with cURL
Verify the server is responding:Bash
NVIDIA requires
temperature=1.0 and top_p=0.95 for all inference with this model.Python Integration
Using the OpenAI Python SDK:Python
Reasoning Modes
Nemotron 3 Super supports three reasoning modes, controlled viachat_template_kwargs. By default, reasoning is enabled.
Reasoning ON (Default)
The model shows its thinking inreasoning_content before giving the final answer in content:
Python
Reasoning OFF
Disable reasoning for faster, direct responses:Python
Low-Effort Reasoning
A middle ground — brief reasoning with fast responses:Python
Reasoning with cURL
Passchat_template_kwargs at the top level of the JSON body:
Bash
Cleanup
When you’re done, destroy the instance to stop billing:Bash
Always destroy your instance when you’re finished to avoid unnecessary charges.
Additional Resources
- NVIDIA Nemotron 3 Super Blog Post — Architecture details and benchmarks
- HuggingFace Model Card (FP8) — Model card and usage instructions
- SGLang Documentation — SGLang configuration and usage
- Vast.ai CLI Guide — Learn more about the Vast.ai CLI
- GPU Instance Guide — Understanding Vast.ai instances