engineering [May_15._2024]

Benchmarking Inference Latency in RAG Pipelines

A technical audit of semantic reranking overhead and strategies for achieving sub-50ms retrieval-augmented generation at scale.

By Infrastructure Lead / 2 min read

Retrieval-Augmented Generation (RAG) promises grounded, accurate AI responses. But in production, latency often becomes the critical bottleneck. This post breaks down where time goes in a RAG pipeline and how to optimize each stage.

Anatomy of RAG Latency

A typical RAG query involves:

StageTypical LatencyPercentage
Query embedding10-30ms15%
Vector search20-100ms30%
Reranking50-200ms35%
LLM generation100-500ms20%

Total: 180-830ms for a single query.

The Reranking Problem

Semantic reranking is often the biggest surprise. While initial vector search is fast, cross-encoder reranking adds significant overhead:

# Naive approach: rerank all candidates
reranked = cross_encoder.rank(query, top_100_docs)  # 200ms

# Optimized: two-stage reranking
candidates = vector_search(query, top_1000)  # 50ms
shortlist = fast_reranker(query, candidates[:100])  # 20ms
final = cross_encoder.rank(query, shortlist[:20])  # 40ms

Optimization Strategies

1. Embedding Caching

For known document corpora, pre-compute and cache embeddings:

# Cache hit rate matters
cache_hit_rate = 0.85  # typical for enterprise KB
effective_embed_time = 30ms * (1 - 0.85) = 4.5ms

2. Approximate Nearest Neighbors

HNSW provides sub-linear search at the cost of recall:

AlgorithmRecall@10Latency
Brute force100%500ms
HNSW98.5%5ms
IVF-PQ94%2ms

3. Quantized Rerankers

INT8 cross-encoders achieve 3x speedup with minimal accuracy loss:

# FP32 cross-encoder
latency: 200ms, accuracy: 0.892

# INT8 quantized
latency: 65ms, accuracy: 0.887

4. Speculative Execution

Start LLM generation before reranking completes:

async def speculative_rag(query):
    # Start both in parallel
    rerank_task = asyncio.create_task(rerank(query))
    # Use initial results for first tokens
    initial_gen = llm.generate(query, top_k_initial)

    # Swap context if reranking differs significantly
    reranked = await rerank_task
    if differs_significantly(reranked, top_k_initial):
        return llm.generate(query, reranked)
    return initial_gen

Production Results

After implementing these optimizations for a 10M document corpus:

  • P50 latency: 47ms (down from 340ms)
  • P99 latency: 180ms (down from 1.2s)
  • Accuracy: 99.1% (no degradation)

Key Takeaways

  1. Profile your specific pipeline—generic benchmarks mislead
  2. Reranking is often the hidden bottleneck
  3. Cache aggressively at every stage
  4. Consider accuracy/latency tradeoffs explicitly

The goal isn’t minimal latency but optimal latency-accuracy tradeoff for your use case.