Benchmarking Inference Latency in RAG Pipelines
A technical audit of semantic reranking overhead and strategies for achieving sub-50ms retrieval-augmented generation at scale.
Retrieval-Augmented Generation (RAG) promises grounded, accurate AI responses. But in production, latency often becomes the critical bottleneck. This post breaks down where time goes in a RAG pipeline and how to optimize each stage.
Anatomy of RAG Latency
A typical RAG query involves:
| Stage | Typical Latency | Percentage |
|---|---|---|
| Query embedding | 10-30ms | 15% |
| Vector search | 20-100ms | 30% |
| Reranking | 50-200ms | 35% |
| LLM generation | 100-500ms | 20% |
Total: 180-830ms for a single query.
The Reranking Problem
Semantic reranking is often the biggest surprise. While initial vector search is fast, cross-encoder reranking adds significant overhead:
# Naive approach: rerank all candidates
reranked = cross_encoder.rank(query, top_100_docs) # 200ms
# Optimized: two-stage reranking
candidates = vector_search(query, top_1000) # 50ms
shortlist = fast_reranker(query, candidates[:100]) # 20ms
final = cross_encoder.rank(query, shortlist[:20]) # 40ms
Optimization Strategies
1. Embedding Caching
For known document corpora, pre-compute and cache embeddings:
# Cache hit rate matters
cache_hit_rate = 0.85 # typical for enterprise KB
effective_embed_time = 30ms * (1 - 0.85) = 4.5ms
2. Approximate Nearest Neighbors
HNSW provides sub-linear search at the cost of recall:
| Algorithm | Recall@10 | Latency |
|---|---|---|
| Brute force | 100% | 500ms |
| HNSW | 98.5% | 5ms |
| IVF-PQ | 94% | 2ms |
3. Quantized Rerankers
INT8 cross-encoders achieve 3x speedup with minimal accuracy loss:
# FP32 cross-encoder
latency: 200ms, accuracy: 0.892
# INT8 quantized
latency: 65ms, accuracy: 0.887
4. Speculative Execution
Start LLM generation before reranking completes:
async def speculative_rag(query):
# Start both in parallel
rerank_task = asyncio.create_task(rerank(query))
# Use initial results for first tokens
initial_gen = llm.generate(query, top_k_initial)
# Swap context if reranking differs significantly
reranked = await rerank_task
if differs_significantly(reranked, top_k_initial):
return llm.generate(query, reranked)
return initial_gen
Production Results
After implementing these optimizations for a 10M document corpus:
- P50 latency: 47ms (down from 340ms)
- P99 latency: 180ms (down from 1.2s)
- Accuracy: 99.1% (no degradation)
Key Takeaways
- Profile your specific pipeline—generic benchmarks mislead
- Reranking is often the hidden bottleneck
- Cache aggressively at every stage
- Consider accuracy/latency tradeoffs explicitly
The goal isn’t minimal latency but optimal latency-accuracy tradeoff for your use case.