Serving Rerankers on Vast.ai with vLLM
Rerankers determine relevance between text pairs—matching search queries to documents, evaluating LLM outputs, or finding similar content. They perform detailed comparisons that capture nuanced relationships simple methods miss. This guide covers deploying theBAAI/bge-reranker-base model on Vast.ai using vLLM, with both OpenAI and Cohere-compatible APIs.
When to Use Rerankers
Embedding models with cosine similarity are fast and cheap—they encode text once and compare vectors. But they compress meaning into fixed-size vectors, losing nuance. Rerankers process query-document pairs together through a cross-encoder, capturing subtle relationships embeddings miss.| Approach | Speed | Accuracy | Best For |
|---|---|---|---|
| Embeddings + cosine | Fast | Good | Initial retrieval, large candidate sets |
| Reranker | Slower | Better | Final ranking, top-k refinement |
Prerequisites
- Vast.ai account with credits
- Vast.ai CLI installed (
pip install vastai)
Hardware Requirements
TheBAAI/bge-reranker-base model (~278M parameters) has modest requirements:
- GPU RAM: 16GB (8GB may work for lower throughput)
- GPU: Single GPU, Turing architecture or newer
- Network: Static IP and at least one direct port
Setting Up the CLI
Install and configure the Vast.ai CLI:Finding an Instance
Search for suitable instances:Deploying the Server
First, generate a secure API key to protect your endpoint:Verifying the Deployment
- Go to Instances in the Vast.ai console
- Wait for the image and model to download
- Find your instance’s IP and external port from “Open Ports” (format:
XX.XX.XXX.XX:YYYY -> 8000/tcp)
Using the Reranker
vLLM provides two API endpoints:| Endpoint | API Style | Use Case |
|---|---|---|
/score | OpenAI | Raw scores for custom ranking logic |
/rerank | Cohere | Pre-sorted results for quick integration |
OpenAI-Compatible Endpoint (/score)
The/score endpoint returns raw relevance scores for each query-document pair. This gives you full control over ranking logic:
Cohere-Compatible Endpoint (/rerank)
The/rerank endpoint is Cohere-compatible, returning pre-sorted results. This is useful if you’re migrating from Cohere or want sorted results without manual sorting.
Install the Cohere client:
Score Interpretation
| Score Range | Meaning |
|---|---|
| ~1.0 | Highly relevant, direct match |
| 0.1 - 0.5 | Moderately relevant |
| 0.01 - 0.1 | Tangentially related |
| < 0.001 | Irrelevant |
Use Cases
- RAG Systems: Filter retrieved context before sending to LLM
- Semantic Search: Rerank initial retrieval results
- Duplicate Detection: Identify semantically similar content
- Content Recommendation: Match user queries to content