Building an AI Evaluation Harness That Survives Production

An AI system without an evaluation harness ships changes the way 1990s software shipped changes — on vibes, with prayers. The teams that ship reliably treat eval as test infrastructure. The harness below is the minimum that survives contact with production.

TL;DR — The Stack

Component	Purpose	Frequency
Golden set (100-200 scenarios)	Coverage of important behaviours	On every change
LLM-as-judge rubric scorer	Automated quality scoring	Every prompt change, every CI run
Production trace replay	Catches silent regressions	Daily
Human label calibration	Keeps the judge honest	Quarterly
Online quality sampling	Live production health	Continuous, 3-5% sample

Component 1 — The Golden Set

A golden set is a versioned, hand-curated collection of scenarios with expected behaviour.

The size that matters: 100-200 scenarios is the sweet spot for most production systems. Below 50, you miss too many failure modes. Above 500, the marginal scenario adds less than the maintenance cost.

A scenario is a structured record:

yaml
- id: refund-policy-explanation
  category: billing
  tags: [policy, customer-support, factual]
  user_input: |
    "I want a refund for order #12345 - the product arrived broken."
  user_context:
    role: paying_customer
    tier: standard
    locale: en-US
  expected_behaviors:
    - "Acknowledges the issue with empathy"
    - "References the refund policy by name"
    - "Cites the specific clause about damaged items"
    - "Provides next steps without overpromising"
  forbidden_behaviors:
    - "Promises a refund without policy verification"
    - "Asks the user to email instead of handling it"
  required_citations:
    - "policies/refund-policy.md"
  rubric_weights:
    helpfulness: 0.4
    policy_compliance: 0.3
    tone: 0.2
    groundedness: 0.1

Three properties of useful scenarios:

Behaviour-anchored, not output-anchored. Specify what the response should do, not exact wording. Models vary in phrasing; rubrics handle that.
Negative cases included. Half the value of the golden set is in scenarios where the model should refuse, ask for clarification, or escalate.
Tagged for slicing. Tags let you ask "did the billing category regress?" without scanning the full set.

The golden set lives in git. Pull requests modify it like any other code. New failure modes become new scenarios; that is how the system learns.

Component 2 — LLM-as-Judge

For 100-200 scenarios, human grading is too slow. LLM-as-judge automates it.

python
async def evaluate_scenario(scenario: Scenario, response: AgentResponse) -> ScenarioResult:
    rubric = build_rubric_prompt(scenario, response)
    judge_response = await judge_llm.complete(
        rubric,
        response_model=RubricScore,
    )

    return ScenarioResult(
        scenario_id=scenario.id,
        scores=judge_response.scores,
        weighted_score=weight(judge_response.scores, scenario.rubric_weights),
        explanations=judge_response.explanations,
        passed=judge_response.weighted_score >= scenario.passing_threshold,
    )

def build_rubric_prompt(scenario: Scenario, response: AgentResponse) -> str:
    return f"""
You are evaluating an AI assistant response.

USER INPUT:
{scenario.user_input}

EXPECTED BEHAVIORS:
{format_list(scenario.expected_behaviors)}

FORBIDDEN BEHAVIORS:
{format_list(scenario.forbidden_behaviors)}

ASSISTANT RESPONSE:
{response.text}

CITATIONS PROVIDED:
{response.citations}

Score each dimension from 0 to 1.

Dimensions:
- helpfulness: Did the response solve the user's actual problem?
- policy_compliance: Did it follow all stated policies in EXPECTED_BEHAVIORS?
- tone: Was the tone appropriate for the context?
- groundedness: Are all factual claims supported by the citations?

For each dimension, provide a one-sentence explanation.
"""

What makes LLM-as-judge reliable:

Use a judge at least as capable as the model under test. A Haiku judging Opus is unreliable. A Sonnet judging Haiku is fine.
Be specific in the rubric. "Score helpfulness" is vague; "Score helpfulness — did the response provide actionable next steps the user can take immediately?" is graded consistently.
Force structured output. The judge returns scores in a typed schema, not free-form prose.
Provide chain-of-thought explanations. Useful for debugging when scores look wrong.

Component 3 — Production Trace Replay

The golden set covers your imagined scenarios. Production traffic covers your actual scenarios — including the ones you didn't anticipate.

The replay pattern:

Sample 3-5% of production traces daily.
Re-run each trace against the current prompt / model configuration.
Score both the original response and the replayed response with the same rubric.
Flag scenarios where the score moved.

python
async def replay_traces_daily():
    yesterday_traces = await trace_store.sample(
        date=yesterday(),
        rate=0.04,
        limit=2000,
    )

    for trace in yesterday_traces:
        original_score = await evaluator.score(trace.original_response, trace.context)
        replay_response = await agent_current.run(trace.user_input, trace.context)
        replay_score = await evaluator.score(replay_response, trace.context)

        if replay_score < original_score - SIGNIFICANT_DELTA:
            await alert.send(
                "Regression detected",
                trace_id=trace.id,
                delta=original_score - replay_score,
            )

This catches the most insidious failure mode: prompt change A passes the golden set, ships to production, and quietly degrades 8% of real traffic in a way the golden set didn't cover.

Component 4 — Human Label Calibration

LLM judges drift. Their rubric interpretation shifts subtly across model versions, and they can develop systematic biases that look statistically clean but are actually wrong.

The calibration:

Quarterly, sample 100-200 scenarios from production with diverse outcomes.
Have a human (subject matter expert) label them.
Compare LLM judge scores to human scores.
If agreement falls below 80% on any dimension, refine the rubric.

This is not a heavy lift — a single afternoon per quarter — but it is what makes LLM-as-judge trustworthy over time.

Component 5 — Online Quality Sampling

The four components above are pre-deployment. Online sampling is post-deployment: continuous monitoring of live production quality.

The pattern:

3-5% of production responses scored by the LLM judge in near-real-time.
Scores feed a rolling 24-hour quality dashboard.
Alerts fire on >2 standard deviation moves on any dimension.

This is the canary that tells you a provider model update silently broke your tone, or that a new corpus addition is hurting groundedness on a specific category.

What to Score

The four metrics that consistently matter across production systems:

Metric	What it measures	When it matters most
Groundedness	Claims supported by retrieved evidence	RAG, knowledge systems
Helpfulness	Did the response solve the user's actual problem?	Every system
Policy compliance	Did it follow the stated rules?	Regulated industries, enterprise
Format compliance	Correct structure (JSON, schema, etc.)	API-facing, tool-calling

Domain-specific metrics layer on top: medical accuracy, code correctness, legal citation validity, etc.

Anti-Patterns

Eval as an afterthought. Build the harness before the v1 product, not after. Adding it later requires retrofitting traces.
Vibes-based regression detection. "It looks better" is not a metric. If you can't show a number that moved, you don't know.
Score averages without slices. Aggregate scores hide regressions on specific categories. Always slice by tag.
Once-and-done golden sets. Treat them like test suites — they grow with the product. Every bug fix adds a scenario.
Judge model = production model. Use a different judge so the judge's failure modes don't align with the production model's failure modes.

The 14-Day Build Plan

A focused team builds a production-grade eval harness in two weeks:

Day	Output
1-2	Scenario format defined; first 20 scenarios written across diverse categories
3-4	LLM judge prompt and rubric scoring pipeline; CI integration
5-6	Golden set expanded to 100 scenarios with negative cases
7-8	Production trace capture; replay infrastructure
9-10	Online sampling deployed; quality dashboard live
11-12	First calibration pass with human labels; rubric refined
13-14	Documentation, runbooks for "the eval suite is failing"

Two weeks of focused work, then continuous incremental improvement. The first prompt change that ships with the harness in place pays back the entire build cost in confidence alone.

Frequently Asked Questions

What's the minimum viable AI evaluation harness?

Can I use the same LLM to evaluate itself?

Yes, with caveats. LLM-as-judge works well when the judge is at least as capable as the model under test, the rubric is specific, and you periodically calibrate against human-labeled samples.

How often should the eval suite run?

On every prompt change, every model version update, and at least daily on a fresh sample of production traces.

What metrics actually matter for AI quality?

Four that matter most: groundedness, helpfulness, policy compliance, and format compliance. Track each on a 0-1 scale per scenario.

Are tools like LangSmith / Langfuse worth using?

Yes. Both implement the trace-capture and replay parts cleanly and save weeks of build time. The golden-set and rubric definitions still belong in your repository regardless of which tool stores them.

Key Takeaways

An eval harness is non-negotiable for production AI — without it, you cannot ship prompt changes with confidence.
100-200 golden scenarios cover 80% of the value; the marginal scenario after that adds less.
LLM-as-judge with a specific rubric is the workhorse — calibrate against human labels quarterly.
Production trace replay catches silent regressions that golden sets miss.
Two weeks of focused work builds the harness; every shipping prompt change after that proves its worth.

Frequently Asked Questions

What's the minimum viable AI evaluation harness?

Three components: a 100-200 scenario golden set with expected behavior, an LLM-as-judge rubric scorer running in CI, and a regression replay that re-scores yesterday's production traces against today's prompts. Skip the first and you're shipping blind; skip the second and you can't scale; skip the third and you miss silent regressions.

Can I use the same LLM to evaluate itself?

Yes, with caveats. LLM-as-judge works well when the judge is at least as capable as the model under test, the rubric is specific, and you periodically calibrate against human-labeled samples. A weaker judge model gives unreliable signals.

How often should the eval suite run?

On every prompt change, every model version update, and at least daily on a fresh sample of production traces. Production drift is real — yesterday's pass doesn't guarantee today's pass.

What metrics actually matter for AI quality?

Four that matter most: groundedness (claims supported by evidence), helpfulness (did it solve the user's actual problem), policy compliance (followed the rules), and format compliance (correct structure). Track each on a 0-1 scale per scenario.

Building an AI Evaluation Harness That Survives Production

TL;DR — The Stack

Component 1 — The Golden Set

Component 2 — LLM-as-Judge

Component 3 — Production Trace Replay

Component 4 — Human Label Calibration

Component 5 — Online Quality Sampling

What to Score

Anti-Patterns

The 14-Day Build Plan

Frequently Asked Questions

What's the minimum viable AI evaluation harness?

Can I use the same LLM to evaluate itself?

How often should the eval suite run?

What metrics actually matter for AI quality?

Are tools like LangSmith / Langfuse worth using?

Key Takeaways

What's the minimum viable AI evaluation harness?

Can I use the same LLM to evaluate itself?

How often should the eval suite run?

What metrics actually matter for AI quality?

Let's Build Your
Sovereign System

SyntharaTechnologies

Services

Directives

Direct Communication

INITIATE
PROTOCOL.

Building an AI Evaluation Harness That Survives Production

TL;DR — The Stack

Component 1 — The Golden Set

Component 2 — LLM-as-Judge

Component 3 — Production Trace Replay

Component 4 — Human Label Calibration

Component 5 — Online Quality Sampling

What to Score

Anti-Patterns

The 14-Day Build Plan

Frequently Asked Questions

What's the minimum viable AI evaluation harness?

Can I use the same LLM to evaluate itself?

How often should the eval suite run?

What metrics actually matter for AI quality?

Are tools like LangSmith / Langfuse worth using?

Key Takeaways

What's the minimum viable AI evaluation harness?

Can I use the same LLM to evaluate itself?

How often should the eval suite run?

What metrics actually matter for AI quality?

Let's Build Your Sovereign System

SyntharaTechnologies

Services

Directives

Direct Communication

INITIATE PROTOCOL.

Let's Build Your
Sovereign System

INITIATE
PROTOCOL.