Validate in practice — Internal docs Q&A (DevSquad)

Part 4 · VALIDATE · Scenario 3 of 3


"For a Synthesizer, refusal precision matters as much as answer accuracy. The Synthesizer that fabricates answers under pressure is the worst version of itself."


Setting

Monday morning, week 3. The agent shipped to a 5% canary on Friday. Today the team runs the pre-launch eval suite, the citation-grounding stress tests, and the launch gate decision. If the gates pass, the canary expands to 25% mid-week and to 100% the following week. The Validate phase has two halves: the pre-launch gates (eval suite + grounding tests + DevSquad's review phase) and the first-month operational validation (the four signal metrics + the docs-gap-finding rate + the categorization of the first failures).

This is the most-DevSquad-flavored Validate phase among the three scenarios. DevSquad's Phase 7 (Review in an independent context) maps directly to the framework's Validate activity — the team's review agent, running in a fresh sub-agent context, judges agent outputs against the spec acceptance suite. The framework's eval discipline and DevSquad's review-in-independent-context discipline compose because both projects independently arrived at the team that builds should not be the only team that judges.


DevSquad mapping at this phase

AoI ActivityDevSquad Phase
Validate (this chapter)DevSquad Phase 6 — Learn in the open; DevSquad Phase 7 — Review in an independent context

DevSquad Phase 6 (Learn in the open) is where the team categorizes failures — Cat 1 through Cat 7 — with the trace and the failure-locus analysis. Phase 7 (Review in an independent context) is where the review agent runs the eval suite and the spec-conformance check in a fresh sub-agent context. The framework's pre-launch validation is the start of this two-phase cycle; the first-month operational validation is what the cycle looks like in steady state.


Pre-launch eval suite

Devon assembled the suite over the prior sprint: 200 curated Q-A pairs from docs-team curation. Each pair is (factual question, expected canonical answer with the authoritative-doc URL), structured so the eval can grade both answer accuracy (does the agent's answer match the canonical?) and citation accuracy (does the agent cite the authoritative URL or one of the cross-linked equivalents?).

The suite is structured per DevSquad's Spec Conformance Testing discipline — each test names the spec section it validates.

Pass threshold per §9: ≥ 85% on the known-good set and ≥ 90% refusal precision on the held-out 50-question out-of-scope set.

The first run lands at 78% on the known-good set and 84% on the out-of-scope set. Both below threshold. The team categorizes the failures:

Failure clusterCountCatFix locus
Citation grounding check passed but the cited URL was tangentially related, not authoritative12Cat 1§11 — composition rule prefers authoritative source over related
Multi-doc questions retrieved partial information across docs but composition didn't merge them well8Cat 1§11 — multi-doc composition discipline
Refusals on questions whose answers were in the docs but in a non-obvious format (e.g., embedded in a code comment)6Cat 2Retrieval — extend indexer to cover code comments and docstrings
The agent occasionally answered HR-adjacent questions before the routing filter fired4Cat 1§3/§4 — tighten HR-domain triage
Refusal precision failures — the agent refused questions that were answerable from the docs but with low retrieval confidence3Cat 1§11 — confidence threshold tuning
Mode-marker missing on a small set of refusal responses2Cat 4OVG — tighten mode-marker check

Twenty-five Cat 1 amendments, six Cat 2 fixes, two Cat 4 amendments. After the amendments land, the suite re-runs at 88% on the known-good and 92% on the out-of-scope. Above thresholds. Each amendment recorded in the spec evolution log.

The team observes a pattern early: most of the failures concentrate in §11 — the composition rules. The original §11 was operationally focused on the citation-discipline language, but the multi-doc composition and authoritative-source preference disciplines were under-specified. The team schedules a §11 structural rewrite after the first 30 days of operation, similar to the rewrites in Scenarios 1 and 2.


DevSquad's review agent in independent context

The review agent runs the eval suite and the spec-conformance check in a fresh sub-agent context — meaning it does not have access to the implement-phase context that produced the changes. The judging criteria are driven entirely by the spec acceptance suite (§9) and the spec invariants (§6).

The review agent's output is a structured judgment per task: pass, pass with notes, or fail with reason. The team observes a useful side effect: the review agent occasionally surfaces failure patterns the team's manual review missed because the team had context the review agent didn't. Two examples from the pre-launch run:

  • The review agent flagged a citation-grounding score of 0.74 (just below the 0.75 threshold) and asked whether the threshold itself was too strict for the corpus. The team had been treating the threshold as fixed; the review agent's question prompted a re-tune (raised to 0.78 with retraining of the grounding classifier on a wider claim-doc set, which improved both score-distribution and the false-rejection rate).
  • The review agent flagged a slice of out-of-scope refusals where the routing pointer named a team that had been renamed three months ago. The team's manual review hadn't caught it because two of the team members had been on the old team and used the old name reflexively. The review agent had no such context and noticed the inconsistency.

Both are Cat 4 amendments (oversight-layer fixes); both make the spec-evolution log.


The launch gate decision

Wednesday of week 4. The team meets to decide: expand from 5% canary to 25%, or hold?

CriterionTargetActualPass?
Eval suite known-good first-pass≥ 85%88%
Out-of-scope refusal precision≥ 90%92%
Invariant violations00
p95 latency≤ 4.0s3.2s
Signal metrics emittingyesyes
Docs-gap-finding feed integratedyesyes
DevSquad review agent acceptancepasspass-with-notes⚠️

All hard gates pass. The review agent's pass-with-notes is on the threshold-tuning point above, which the team has already actioned. The team expands to 25% canary Wednesday afternoon.

The 25% canary holds for 5 days with metrics nominal. Promote to 75%. Hold for 5 days with metrics nominal. Promote to 100%. The agent is in full production by end of week 6, available to all ~200 internal engineers.


The first 30 days: signal metrics in operation

Day-30 readings:

MetricDay 1Day 30TargetTrajectory
First-answer-satisfaction73%81%≥ 80% in 30-day rolling✅ — at target
Refusal precision88%93%≥ 92%✅ — at target
Cost per accepted answer$0.014$0.011≤ $0.012✅ — at target
Oversight load (reviewer-min/1000 questions)42small (no explicit target)
Docs-gap-finding rate (positive signal)22%18%rising over time⚠️ — see below

Four metrics on track. The docs-gap-finding rate trajectory is the team's most-watched signal, and it requires interpretation. Rising would mean the agent is surfacing new gaps as the asker base widens; flat would mean the docs team is keeping up; falling (which is what happened in the first 30 days) could mean either the docs team is keeping up faster than new gaps surface (good) or the agent is becoming more confident on weak retrieval and therefore refusing less when it should refuse (bad — refusal precision would suffer if so).

The team checks: refusal precision rose from 88% to 93%. The docs-gap-finding rate fell from 22% to 18%. Both signals are positive — the agent is refusing more accurately and finding fewer net-new gaps. The interpretation: the docs team is keeping up. The team confirms this with the docs team — the docs-gap-candidate feed has been actioned 38 times in 30 days, with new docs authored or amended in response. The trajectory will be monitored at day 90 to see if the rate stabilizes or continues falling.


The first month's Cat 1–7 categorization

The team rolls up the spec evolution log entries from the first 30 days of production. Twelve consequential failures, each traced and categorized:

#FailureCatFix locusDevSquad phase
1Agent's answer cited a stale doc that was correct when written but had since been deprecatedCat 1§11 — composition rule prefers freshness signalPhase 6
2Multi-doc question's answer was correct but the citations were in inconsistent formats (one URL, one Markdown link)Cat 1Skill file — citation format standardizationPhase 6
3Agent answered a question about a recently-renamed service using the old name from a stale docCat 1Retrieval — freshness re-rank weightPhase 5 → Phase 6
4Refusal precision missed: agent answered confidently on a topic that was thinly documentedCat 1§11 — confidence-threshold edge casesPhase 6
5The docs-gap-candidate feed produced duplicates when the same question was asked by different askersCat 2Tool — dedup on candidate emissionPhase 5
6Output Validation Gate let through a response missing the mode markerCat 4OVG — strengthen mode-marker checkPhase 7
7The agent's "no confident answer" responses occasionally pointed to a team that had been merged into anotherCat 1Skill file — team-routing source-of-truthPhase 6
8A code-generation question slipped past the §3 filter because the question was phrased non-obviously ("how would I add caching to this snippet?")Cat 1§3 — broaden code-generation triagePhase 6
9The agent's response was correct but used uncertainty language too aggressively (every claim hedged)Cat 1§11 — uncertainty-language calibrationPhase 6
10A retrieval miss on a question whose answer was in a Slack thread the curated archive didn't includeCat 2Retrieval — extend Slack archive coveragePhase 5
11An asker re-asked the same question 3 times within 2 hours; the docs-gap-candidate didn't escalateCat 4Routing — re-ask-aware escalationPhase 6
12A Sonnet 4.6 fallback (low-confidence composition) was over-used — the team noticed the cost trended up before the §4 ceiling firedCat 1§4 Cost Posture — adjust fallback thresholdPhase 6

Eight Cat 1, two Cat 2, two Cat 4. Zero Cat 6 amendments — no model-level failures of consequence. Zero Cat 7 — the agent has no perception/action interface; Cat 7 doesn't apply, as the team noted in the Frame artifact.

The DevSquad-phase column is operationally useful: it tells the team which phase of the DevSquad cycle the failure surfaces in, which informs which agent's prompt or skill file gets amended.

The per-sprint roll-up identifies the §11 cluster (5 of 8 Cat 1s). The team confirms the §11 structural rewrite is needed and schedules it for sprint 2, the same way Scenarios 1 and 2 had to.


What the Validate phase produces

By the end of the first 30 days:

  • An eval suite that runs in CI on every spec amendment, with the DevSquad review agent providing independent-context judgment.
  • Pre-launch validation that found 25 Cat 1s before launch.
  • DevSquad-phase-aligned failure tracking that ties each failure to the cycle phase where it surfaced.
  • Five signal metrics emitting (the four standard plus the docs-gap-finding rate as a positive signal).
  • A spec evolution log with 12 categorized failures and their amendments.
  • A pattern-finding (the §11 cluster) driving a structural rewrite, not a patch series.
  • Confirmation that the docs team is actively absorbing the docs-gap-candidate feed (38 actions in 30 days).
  • Evidence that the citation-grounding check is doing its job — no fabricated-citation incidents in production.

The Validate phase blends into Evolve from here. The same metrics, the same DevSquad phases, the same docs-gap-candidate feed carry forward. The activity changes from one-time launch validation to ongoing closed-loop discipline embedded in DevSquad's Refine continuously phase.


Reading path through this scenario

Conceptual chapters this scenario binds to