An AI system without an evaluation harness ships changes the way 1990s software shipped changes — on vibes, with prayers. The teams that ship reliably treat eval as test infrastructure. The harness below is the minimum that survives contact with production.
TL;DR — The Stack
| Component | Purpose | Frequency |
|---|---|---|
| Golden set (100-200 scenarios) | Coverage of important behaviours | On every change |
| LLM-as-judge rubric scorer | Automated quality scoring | Every prompt change, every CI run |
| Production trace replay | Catches silent regressions | Daily |
| Human label calibration | Keeps the judge honest | Quarterly |
| Online quality sampling | Live production health | Continuous, 3-5% sample |
Component 1 — The Golden Set
A golden set is a versioned, hand-curated collection of scenarios with expected behaviour.
The size that matters: 100-200 scenarios is the sweet spot for most production systems. Below 50, you miss too many failure modes. Above 500, the marginal scenario adds less than the maintenance cost.
A scenario is a structured record:
yaml- id: refund-policy-explanation category: billing tags: [policy, customer-support, factual] user_input: | "I want a refund for order #12345 - the product arrived broken." user_context: role: paying_customer tier: standard locale: en-US expected_behaviors: - "Acknowledges the issue with empathy" - "References the refund policy by name" - "Cites the specific clause about damaged items" - "Provides next steps without overpromising" forbidden_behaviors: - "Promises a refund without policy verification" - "Asks the user to email instead of handling it" required_citations: - "policies/refund-policy.md" rubric_weights: helpfulness: 0.4 policy_compliance: 0.3 tone: 0.2 groundedness: 0.1
Three properties of useful scenarios:
- Behaviour-anchored, not output-anchored. Specify what the response should do, not exact wording. Models vary in phrasing; rubrics handle that.
- Negative cases included. Half the value of the golden set is in scenarios where the model should refuse, ask for clarification, or escalate.
- Tagged for slicing. Tags let you ask "did the billing category regress?" without scanning the full set.
The golden set lives in git. Pull requests modify it like any other code. New failure modes become new scenarios; that is how the system learns.
Component 2 — LLM-as-Judge
For 100-200 scenarios, human grading is too slow. LLM-as-judge automates it.
pythonasync def evaluate_scenario(scenario: Scenario, response: AgentResponse) -> ScenarioResult: rubric = build_rubric_prompt(scenario, response) judge_response = await judge_llm.complete( rubric, response_model=RubricScore, ) return ScenarioResult( scenario_id=scenario.id, scores=judge_response.scores, weighted_score=weight(judge_response.scores, scenario.rubric_weights), explanations=judge_response.explanations, passed=judge_response.weighted_score >= scenario.passing_threshold, ) def build_rubric_prompt(scenario: Scenario, response: AgentResponse) -> str: return f""" You are evaluating an AI assistant response. USER INPUT: {scenario.user_input} EXPECTED BEHAVIORS: {format_list(scenario.expected_behaviors)} FORBIDDEN BEHAVIORS: {format_list(scenario.forbidden_behaviors)} ASSISTANT RESPONSE: {response.text} CITATIONS PROVIDED: {response.citations} Score each dimension from 0 to 1. Dimensions: - helpfulness: Did the response solve the user's actual problem? - policy_compliance: Did it follow all stated policies in EXPECTED_BEHAVIORS? - tone: Was the tone appropriate for the context? - groundedness: Are all factual claims supported by the citations? For each dimension, provide a one-sentence explanation. """
What makes LLM-as-judge reliable:
- Use a judge at least as capable as the model under test. A Haiku judging Opus is unreliable. A Sonnet judging Haiku is fine.
- Be specific in the rubric. "Score helpfulness" is vague; "Score helpfulness — did the response provide actionable next steps the user can take immediately?" is graded consistently.
- Force structured output. The judge returns scores in a typed schema, not free-form prose.
- Provide chain-of-thought explanations. Useful for debugging when scores look wrong.
Component 3 — Production Trace Replay
The golden set covers your imagined scenarios. Production traffic covers your actual scenarios — including the ones you didn't anticipate.
The replay pattern:
- Sample 3-5% of production traces daily.
- Re-run each trace against the current prompt / model configuration.
- Score both the original response and the replayed response with the same rubric.
- Flag scenarios where the score moved.
pythonasync def replay_traces_daily(): yesterday_traces = await trace_store.sample( date=yesterday(), rate=0.04, limit=2000, ) for trace in yesterday_traces: original_score = await evaluator.score(trace.original_response, trace.context) replay_response = await agent_current.run(trace.user_input, trace.context) replay_score = await evaluator.score(replay_response, trace.context) if replay_score < original_score - SIGNIFICANT_DELTA: await alert.send( "Regression detected", trace_id=trace.id, delta=original_score - replay_score, )
This catches the most insidious failure mode: prompt change A passes the golden set, ships to production, and quietly degrades 8% of real traffic in a way the golden set didn't cover.
Component 4 — Human Label Calibration
LLM judges drift. Their rubric interpretation shifts subtly across model versions, and they can develop systematic biases that look statistically clean but are actually wrong.
The calibration:
- Quarterly, sample 100-200 scenarios from production with diverse outcomes.
- Have a human (subject matter expert) label them.
- Compare LLM judge scores to human scores.
- If agreement falls below 80% on any dimension, refine the rubric.
This is not a heavy lift — a single afternoon per quarter — but it is what makes LLM-as-judge trustworthy over time.
Component 5 — Online Quality Sampling
The four components above are pre-deployment. Online sampling is post-deployment: continuous monitoring of live production quality.
The pattern:
- 3-5% of production responses scored by the LLM judge in near-real-time.
- Scores feed a rolling 24-hour quality dashboard.
- Alerts fire on >2 standard deviation moves on any dimension.
This is the canary that tells you a provider model update silently broke your tone, or that a new corpus addition is hurting groundedness on a specific category.
What to Score
The four metrics that consistently matter across production systems:
| Metric | What it measures | When it matters most |
|---|---|---|
| Groundedness | Claims supported by retrieved evidence | RAG, knowledge systems |
| Helpfulness | Did the response solve the user's actual problem? | Every system |
| Policy compliance | Did it follow the stated rules? | Regulated industries, enterprise |
| Format compliance | Correct structure (JSON, schema, etc.) | API-facing, tool-calling |
Domain-specific metrics layer on top: medical accuracy, code correctness, legal citation validity, etc.
Anti-Patterns
- Eval as an afterthought. Build the harness before the v1 product, not after. Adding it later requires retrofitting traces.
- Vibes-based regression detection. "It looks better" is not a metric. If you can't show a number that moved, you don't know.
- Score averages without slices. Aggregate scores hide regressions on specific categories. Always slice by tag.
- Once-and-done golden sets. Treat them like test suites — they grow with the product. Every bug fix adds a scenario.
- Judge model = production model. Use a different judge so the judge's failure modes don't align with the production model's failure modes.
The 14-Day Build Plan
A focused team builds a production-grade eval harness in two weeks:
| Day | Output |
|---|---|
| 1-2 | Scenario format defined; first 20 scenarios written across diverse categories |
| 3-4 | LLM judge prompt and rubric scoring pipeline; CI integration |
| 5-6 | Golden set expanded to 100 scenarios with negative cases |
| 7-8 | Production trace capture; replay infrastructure |
| 9-10 | Online sampling deployed; quality dashboard live |
| 11-12 | First calibration pass with human labels; rubric refined |
| 13-14 | Documentation, runbooks for "the eval suite is failing" |
Two weeks of focused work, then continuous incremental improvement. The first prompt change that ships with the harness in place pays back the entire build cost in confidence alone.
Frequently Asked Questions
What's the minimum viable AI evaluation harness?
Three components: a 100-200 scenario golden set with expected behavior, an LLM-as-judge rubric scorer running in CI, and a regression replay that re-scores yesterday's production traces against today's prompts.
Can I use the same LLM to evaluate itself?
Yes, with caveats. LLM-as-judge works well when the judge is at least as capable as the model under test, the rubric is specific, and you periodically calibrate against human-labeled samples.
How often should the eval suite run?
On every prompt change, every model version update, and at least daily on a fresh sample of production traces.
What metrics actually matter for AI quality?
Four that matter most: groundedness, helpfulness, policy compliance, and format compliance. Track each on a 0-1 scale per scenario.
Are tools like LangSmith / Langfuse worth using?
Yes. Both implement the trace-capture and replay parts cleanly and save weeks of build time. The golden-set and rubric definitions still belong in your repository regardless of which tool stores them.
Key Takeaways
- An eval harness is non-negotiable for production AI — without it, you cannot ship prompt changes with confidence.
- 100-200 golden scenarios cover 80% of the value; the marginal scenario after that adds less.
- LLM-as-judge with a specific rubric is the workhorse — calibrate against human labels quarterly.
- Production trace replay catches silent regressions that golden sets miss.
- Two weeks of focused work builds the harness; every shipping prompt change after that proves its worth.
What's the minimum viable AI evaluation harness?
Three components: a 100-200 scenario golden set with expected behavior, an LLM-as-judge rubric scorer running in CI, and a regression replay that re-scores yesterday's production traces against today's prompts. Skip the first and you're shipping blind; skip the second and you can't scale; skip the third and you miss silent regressions.
Can I use the same LLM to evaluate itself?
Yes, with caveats. LLM-as-judge works well when the judge is at least as capable as the model under test, the rubric is specific, and you periodically calibrate against human-labeled samples. A weaker judge model gives unreliable signals.
How often should the eval suite run?
On every prompt change, every model version update, and at least daily on a fresh sample of production traces. Production drift is real — yesterday's pass doesn't guarantee today's pass.
What metrics actually matter for AI quality?
Four that matter most: groundedness (claims supported by evidence), helpfulness (did it solve the user's actual problem), policy compliance (followed the rules), and format compliance (correct structure). Track each on a 0-1 scale per scenario.
