Deterministic AI: Building Predictable Systems
How to apply software engineering rigors—unit testing, versioning, and drift detection—to non-deterministic large language models.
Large Language Models are fundamentally non-deterministic. Same input, different outputs. This terrifies traditional software engineers—and for good reason. But we can apply proven engineering practices to make AI systems predictable enough for enterprise deployment.
The Determinism Spectrum
Not all non-determinism is equal:
| Source | Controllable? | Strategy |
|---|---|---|
| Temperature | Yes | Set to 0 |
| Sampling | Yes | Use greedy decoding |
| Context window | Partially | Standardize preprocessing |
| Model updates | No | Version lock + testing |
| Prompt interpretation | No | Evaluation suites |
Testing Strategies
Unit Tests for Prompts
def test_classification_prompt():
prompt = ClassificationPrompt()
# Test known categories
assert prompt.classify("refund request") == "billing"
assert prompt.classify("can't login") == "technical"
# Test edge cases
assert prompt.classify("") == "unknown"
assert prompt.classify("asdfghjkl") == "unknown"
Regression Tests
Track outputs across model versions:
class RegressionSuite:
def __init__(self, golden_set):
self.golden = golden_set # 1000 input-output pairs
def evaluate(self, model):
results = []
for input, expected in self.golden:
actual = model(input)
results.append(similarity(actual, expected))
return {
'mean_similarity': np.mean(results),
'regression_count': sum(r < 0.8 for r in results)
}
Behavioral Tests
Test for behaviors, not exact outputs:
def test_refusal_behavior():
"""Model should refuse inappropriate requests"""
harmful_prompts = load_test_set('harmful_prompts')
for prompt in harmful_prompts:
response = model(prompt)
assert contains_refusal(response)
assert not contains_harmful_content(response)
Versioning LLM Systems
Configuration as Code
# prompt_config.v2.3.yaml
model: gpt-4-0125-preview
temperature: 0
max_tokens: 2048
system_prompt: |
You are a customer service agent...
preprocessing:
- normalize_whitespace
- truncate_to: 4000
postprocessing:
- extract_json
- validate_schema: response_v2
A/B Testing Infrastructure
class PromptExperiment:
def __init__(self, control, treatment, traffic_split=0.1):
self.control = control
self.treatment = treatment
self.split = traffic_split
def route(self, request):
if hash(request.user_id) % 100 < self.split * 100:
return self.treatment.process(request)
return self.control.process(request)
Drift Detection
Models and data drift over time. Detect it early:
Input Drift
def detect_input_drift(recent_embeddings, baseline_embeddings):
recent_centroid = np.mean(recent_embeddings, axis=0)
baseline_centroid = np.mean(baseline_embeddings, axis=0)
drift_score = cosine_distance(recent_centroid, baseline_centroid)
if drift_score > DRIFT_THRESHOLD:
alert("Input distribution has shifted significantly")
Output Quality Monitoring
class OutputMonitor:
def __init__(self):
self.metrics = ['coherence', 'relevance', 'safety']
def evaluate(self, response):
scores = {}
for metric in self.metrics:
scores[metric] = self.evaluate_metric(response, metric)
# Log for dashboarding
self.log_metrics(scores)
# Alert on degradation
if scores['coherence'] < 0.7:
self.alert('coherence_degradation')
Conclusion
You can’t make LLMs perfectly deterministic, but you can make them predictably non-deterministic. The key is comprehensive testing, careful versioning, and continuous monitoring.
Treat your AI system like a service that might misbehave—because it will. The question is whether you’ll know when it does.