insights [May_2._2024]

Deterministic AI: Building Predictable Systems

How to apply software engineering rigors—unit testing, versioning, and drift detection—to non-deterministic large language models.

By Tech Strategist / 3 min read

Large Language Models are fundamentally non-deterministic. Same input, different outputs. This terrifies traditional software engineers—and for good reason. But we can apply proven engineering practices to make AI systems predictable enough for enterprise deployment.

The Determinism Spectrum

Not all non-determinism is equal:

SourceControllable?Strategy
TemperatureYesSet to 0
SamplingYesUse greedy decoding
Context windowPartiallyStandardize preprocessing
Model updatesNoVersion lock + testing
Prompt interpretationNoEvaluation suites

Testing Strategies

Unit Tests for Prompts

def test_classification_prompt():
    prompt = ClassificationPrompt()

    # Test known categories
    assert prompt.classify("refund request") == "billing"
    assert prompt.classify("can't login") == "technical"

    # Test edge cases
    assert prompt.classify("") == "unknown"
    assert prompt.classify("asdfghjkl") == "unknown"

Regression Tests

Track outputs across model versions:

class RegressionSuite:
    def __init__(self, golden_set):
        self.golden = golden_set  # 1000 input-output pairs

    def evaluate(self, model):
        results = []
        for input, expected in self.golden:
            actual = model(input)
            results.append(similarity(actual, expected))

        return {
            'mean_similarity': np.mean(results),
            'regression_count': sum(r < 0.8 for r in results)
        }

Behavioral Tests

Test for behaviors, not exact outputs:

def test_refusal_behavior():
    """Model should refuse inappropriate requests"""
    harmful_prompts = load_test_set('harmful_prompts')

    for prompt in harmful_prompts:
        response = model(prompt)
        assert contains_refusal(response)
        assert not contains_harmful_content(response)

Versioning LLM Systems

Configuration as Code

# prompt_config.v2.3.yaml
model: gpt-4-0125-preview
temperature: 0
max_tokens: 2048
system_prompt: |
  You are a customer service agent...
preprocessing:
  - normalize_whitespace
  - truncate_to: 4000
postprocessing:
  - extract_json
  - validate_schema: response_v2

A/B Testing Infrastructure

class PromptExperiment:
    def __init__(self, control, treatment, traffic_split=0.1):
        self.control = control
        self.treatment = treatment
        self.split = traffic_split

    def route(self, request):
        if hash(request.user_id) % 100 < self.split * 100:
            return self.treatment.process(request)
        return self.control.process(request)

Drift Detection

Models and data drift over time. Detect it early:

Input Drift

def detect_input_drift(recent_embeddings, baseline_embeddings):
    recent_centroid = np.mean(recent_embeddings, axis=0)
    baseline_centroid = np.mean(baseline_embeddings, axis=0)

    drift_score = cosine_distance(recent_centroid, baseline_centroid)

    if drift_score > DRIFT_THRESHOLD:
        alert("Input distribution has shifted significantly")

Output Quality Monitoring

class OutputMonitor:
    def __init__(self):
        self.metrics = ['coherence', 'relevance', 'safety']

    def evaluate(self, response):
        scores = {}
        for metric in self.metrics:
            scores[metric] = self.evaluate_metric(response, metric)

        # Log for dashboarding
        self.log_metrics(scores)

        # Alert on degradation
        if scores['coherence'] < 0.7:
            self.alert('coherence_degradation')

Conclusion

You can’t make LLMs perfectly deterministic, but you can make them predictably non-deterministic. The key is comprehensive testing, careful versioning, and continuous monitoring.

Treat your AI system like a service that might misbehave—because it will. The question is whether you’ll know when it does.