Serving Rerankers on Vast.ai with vLLM
Rerankers determine relevance between text pairs-matching search queries to documents, evaluating LLM outputs, or finding similar content. They perform detailed comparisons that capture nuanced relationships simple methods miss. This guide covers deploying theBAAI/bge-reranker-base model on Vast.ai using vLLM, with both OpenAI and Cohere-compatible APIs.
When to Use Rerankers
Embedding models with cosine similarity are fast and cheap-they encode text once and compare vectors. But they compress meaning into fixed-size vectors, losing nuance. Rerankers process query-document pairs together through a cross-encoder, capturing subtle relationships embeddings miss.| Approach | Speed | Accuracy | Best For |
|---|---|---|---|
| Embeddings + cosine | Fast | Good | Initial retrieval, large candidate sets |
| Reranker | Slower | Better | Final ranking, top-k refinement |
Prerequisites
- Vast.ai account with credits
- Vast.ai CLI installed (
pip install vastai)
Hardware Requirements
TheBAAI/bge-reranker-base model (~278M parameters) has modest requirements:
- GPU RAM: 16GB (8GB may work for lower throughput)
- GPU: Single GPU, Turing architecture or newer
- Network: Static IP and at least one direct port
Setting Up the CLI
Install and configure the Vast.ai CLI:Finding an Instance
Search for suitable instances:Deploying the Server
First, generate a secure API key to protect your endpoint:Verifying the Deployment
- Go to Instances in the Vast.ai console
- Wait for the image and model to download
- Find your instance’s IP and external port from “Open Ports” (format:
XX.XX.XXX.XX:YYYY -> 8000/tcp)
Using the Reranker
vLLM provides two API endpoints:| Endpoint | API Style | Use Case |
|---|---|---|
/score | OpenAI | Raw scores for custom ranking logic |
/rerank | Cohere | Pre-sorted results for quick integration |
OpenAI-Compatible Endpoint (/score)
The/score endpoint returns raw relevance scores for each query-document pair. This gives you full control over ranking logic:
Cohere-Compatible Endpoint (/rerank)
The/rerank endpoint is Cohere-compatible, returning pre-sorted results. This is useful if you’re migrating from Cohere or want sorted results without manual sorting.
Install the Cohere client:
Score Interpretation
| Score Range | Meaning |
|---|---|
| ~1.0 | Highly relevant, direct match |
| 0.1 - 0.5 | Moderately relevant |
| 0.01 - 0.1 | Tangentially related |
| < 0.001 | Irrelevant |
Use Cases
- RAG Systems: Filter retrieved context before sending to LLM
- Semantic Search: Rerank initial retrieval results
- Duplicate Detection: Identify semantically similar content
- Content Recommendation: Match user queries to content