Introduction
Autoresearch is Andrej Karpathy’s framework for autonomous AI-driven ML research. The idea is simple: point an AI agent (Claude Code) at a small but real LLM training setup and let it experiment autonomously overnight. The agent modifies the model code, trains for 5 minutes, checks if the result improved, keeps or discards, and repeats — running ~12 experiments per hour, ~100 overnight.
This guide walks you through setting up autoresearch on a Vast.ai GPU instance with Claude Code as the autonomous research agent.
Prerequisites
Install the Vast CLI if you haven’t already:
pip install vastai
vastai set api-key YOUR_API_KEY
Rent a GPU Instance
Autoresearch requires a single NVIDIA GPU with 80GB VRAM (H100 or A100 80GB). It needs CUDA 12.8+ and about 50GB of disk for the repo, data, and dependencies.
Search for available instances:
vastai search offers 'gpu_ram>=70 num_gpus=1 cuda_vers>=12.8 disk_space>=50 reliability>0.95' -o 'dph+'
Pick an instance ID from the results and rent it:
vastai create instance INSTANCE_ID \
--image vastai/pytorch \
--disk 50 \
--ssh \
--direct
Wait for the instance to be ready:
Once the status shows running, get your SSH connection details:
vastai ssh-url INSTANCE_ID
Set Up the Environment
SSH into your instance:
Vast instances start in a tmux session by default. This keeps your processes running if your SSH connection drops — essential for overnight research runs.
Install uv
uv is the package manager used by autoresearch:
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
Install Claude Code
Claude Code requires Node.js:
curl -fsSL https://deb.nodesource.com/setup_22.x | bash -
apt-get install -y nodejs
npm install -g @anthropic-ai/claude-code
Prepare Data and Run Baseline
Clone and install
cd /workspace
git clone https://github.com/karpathy/autoresearch.git
cd autoresearch
uv sync
Prepare the data
This downloads training data from HuggingFace and trains a BPE tokenizer. Takes about 2 minutes:
Data is cached in ~/.cache/autoresearch/ — you only need to run this once.
Run a baseline experiment
Verify everything works by running a single 5-minute training experiment:
After ~5 minutes you’ll see output like:
---
val_bpb: 0.995583
training_seconds: 300.3
total_seconds: 349.8
peak_vram_mb: 45060.2
mfu_percent: 39.57
total_tokens_M: 497.0
num_steps: 948
num_params_M: 50.3
depth: 8
The key metric is val_bpb (validation bits per byte) — lower is better. Note this baseline number; Claude will try to beat it.
Launch Autonomous Research
Claude Code normally asks for permission before running commands or editing files. For autonomous overnight research, you need to pre-approve the tools Claude will use. Create a settings file in the autoresearch directory:
mkdir -p /workspace/autoresearch/.claude
cat > /workspace/autoresearch/.claude/settings.json << 'EOF'
{
"permissions": {
"allow": [
"Read",
"Edit",
"Write",
"Bash"
]
}
}
EOF
This tells Claude Code to run these commands without asking — essential for unattended operation.
Start Claude Code
cd /workspace/autoresearch
claude
When Claude Code starts, log in to your Anthropic account:
This will give you a URL to open in your browser. Follow the prompts to authenticate, then you’re ready to go.
Kick off the research loop:
Hi have a look at program.md and let's kick off a new experiment! let's do the setup first.
Claude will:
- Read
program.md for the research guidelines
- Create a fresh git branch (e.g.
autoresearch/mar10)
- Run the baseline experiment
- Begin the autonomous loop — modifying
train.py, training for 5 minutes, evaluating, keeping improvements, discarding regressions
- Log all results to
results.tsv
Claude runs indefinitely until manually stopped. Each experiment takes ~5 minutes, so you can expect ~12 experiments/hour and ~100 experiments overnight. Each iteration also uses Claude API tokens.
What Claude can modify
Claude has full freedom to edit train.py — the model architecture, optimizer, hyperparameters, batch size, model size, training loop. The only constraints are:
prepare.py is read-only — the evaluation harness and data loading are fixed
- No new packages — only dependencies in
pyproject.toml
- 5-minute time budget — every experiment runs for exactly 5 minutes
Monitoring progress
In another tmux pane (Ctrl+b then %), you can watch the experiment log:
watch -n 30 cat /workspace/autoresearch/results.tsv
Or check the git log to see what Claude has tried:
cd /workspace/autoresearch
git log --oneline -20
Cleanup
When you’re done, download your results and destroy the instance:
# From your local machine — copy results
scp -P PORT root@HOST_IP:/workspace/autoresearch/results.tsv ./results.tsv
# Destroy the instance
vastai destroy instance INSTANCE_ID
Destroying an instance permanently deletes all data on it. Make sure to copy any results you want to keep before destroying.
Additional Resources