Benchmarking Inference Latency in RAG Pipelines

Retrieval-Augmented Generation (RAG) promises grounded, accurate AI responses. But in production, latency often becomes the critical bottleneck. This post breaks down where time goes in a RAG pipeline and how to optimize each stage.

Anatomy of RAG Latency

A typical RAG query involves:

Stage	Typical Latency	Percentage
Query embedding	10-30ms	15%
Vector search	20-100ms	30%
Reranking	50-200ms	35%
LLM generation	100-500ms	20%

Total: 180-830ms for a single query.

The Reranking Problem

Semantic reranking is often the biggest surprise. While initial vector search is fast, cross-encoder reranking adds significant overhead:

# Naive approach: rerank all candidates
reranked = cross_encoder.rank(query, top_100_docs)  # 200ms

# Optimized: two-stage reranking
candidates = vector_search(query, top_1000)  # 50ms
shortlist = fast_reranker(query, candidates[:100])  # 20ms
final = cross_encoder.rank(query, shortlist[:20])  # 40ms

Optimization Strategies

1. Embedding Caching

For known document corpora, pre-compute and cache embeddings:

# Cache hit rate matters
cache_hit_rate = 0.85  # typical for enterprise KB
effective_embed_time = 30ms * (1 - 0.85) = 4.5ms

2. Approximate Nearest Neighbors

HNSW provides sub-linear search at the cost of recall:

Algorithm	Recall@10	Latency
Brute force	100%	500ms
HNSW	98.5%	5ms
IVF-PQ	94%	2ms

3. Quantized Rerankers

INT8 cross-encoders achieve 3x speedup with minimal accuracy loss:

# FP32 cross-encoder
latency: 200ms, accuracy: 0.892

# INT8 quantized
latency: 65ms, accuracy: 0.887

4. Speculative Execution

Start LLM generation before reranking completes:

async def speculative_rag(query):
    # Start both in parallel
    rerank_task = asyncio.create_task(rerank(query))
    # Use initial results for first tokens
    initial_gen = llm.generate(query, top_k_initial)

    # Swap context if reranking differs significantly
    reranked = await rerank_task
    if differs_significantly(reranked, top_k_initial):
        return llm.generate(query, reranked)
    return initial_gen

Production Results

After implementing these optimizations for a 10M document corpus:

P50 latency: 47ms (down from 340ms)
P99 latency: 180ms (down from 1.2s)
Accuracy: 99.1% (no degradation)

Key Takeaways

Profile your specific pipeline—generic benchmarks mislead
Reranking is often the hidden bottleneck
Cache aggressively at every stage
Consider accuracy/latency tradeoffs explicitly

The goal isn’t minimal latency but optimal latency-accuracy tradeoff for your use case.