Production Telemetry

Part 6 — Operations

"You cannot manage what you cannot measure. For agent systems, what you measure is determined by what you instrument."

Where this sits in v2.0.0: this chapter is part of Part 6 — Operations. Production telemetry is the trace surface that the closed loop requires — without traces, you have an after-the-fact narrative; with them, you have evidence. The Distributed Trace, Cost Tracking, and Anomaly Baseline patterns this chapter integrates feed the Cat-by-Cat categorization and the per-mode failure observability that the running scenarios demonstrate.

Context

Evals measure offline against the spec. Red-team batteries measure offline against threats. Production telemetry is what you can see while the system is running on real traffic. This chapter is the integrated stack: what to capture, what to alert, what to retain, and how the signal feeds back into the spec gap log and the eval suite.

The individual patterns this chapter integrates live in patterns/observability/: Structured Execution Log, Cost Tracking per Spec, Distributed Trace, Anomaly Detection Baseline.

What to instrument

The minimum viable instrumentation. Skipping any row closes a category of post-mortem analysis.

Scope	Required fields
Per task	Correlation ID; spec version; agent version; model + model-version; start/end timestamps; total tokens in/out; cost; outcome (completed / surfaced / errored / timed-out); surface reason if applicable
Per step	Step type (prompt / tool call / tool result / final / surface); tokens in/out; latency (TTFT, generation time); model tier used
Per tool call	Tool name; arguments (sanitized); result schema and size; authorization-check outcome; latency; side-effect summary (the action, not the content)
Per session	Hashed user ID; session start/end; tasks per session; per-session cost roll-up
System	Active spec version; active model version; concurrent tasks; queue depth; cost per hour

What to NOT instrument

The rule: traces should be sufficient for post-mortem and insufficient for surveillance.

Full prompts and outputs — capture by reference (hash + retention bucket) for traces older than 7 days; full content lives in a controlled retention bucket (typically 30–90 days) with access logs.
PII — sanitize at ingestion: regex-redact emails, credit cards, SSNs, internal IDs that map to PII. The sanitization rules live in a constraint library entry inherited by all agents.
Credentials and secrets — never. A credential pattern reaching the trace store is a Cat 1 spec failure that should have been caught upstream.

Alerts vs. monitors

The distinction matters: alerts wake someone up; monitors populate dashboards.

Layer	Examples
Alert (real-time)	Cost spike >2× rolling 24h average; error rate spike >3×; surface rate exceeds spec-declared upper bound for >15 min; sustained tool-layer authorization-refusal spike (potential injection campaign); secret-pattern hit (immediate halt); single task >2× declared wall-clock budget
Monitor (retrospective)	Per-agent first-pass acceptance rate (rolling 7-day); per-tool latency p50/p95/p99; per-tier cost contribution; spec gap log entry rate; eval regression scores; pre/post-spec-change cohort comparisons; token cost decomposition (cached input / uncached input / output)

Waking someone for a dashboard metric is operating wrong; having dashboards but no real-time alerts misses real incidents.

OpenTelemetry GenAI semantic conventions

OpenTelemetry's GenAI semantic conventions (opentelemetry.io/docs/specs/semconv/gen-ai/) define vendor-neutral attribute names: gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.response.finish_reasons, gen_ai.tool.name, gen_ai.tool.call.id. Emit OTel-compliant spans alongside any vendor SDK telemetry. Vendor SDKs change; OTel conventions outlive specific vendors. The cost is small; the benefit is portability.

Standard stack landscape

Choose one and implement it well. Building custom is rarely worth it.

Stack	Best fit
LangSmith	Already on LangChain / LangGraph; turnkey trace UI
Langfuse	Want OSS, self-hostable, framework-agnostic
Phoenix (Arize)	Already on Arize for ML observability; OpenInference
Helicone	Lowest-friction onramp; cost-analytics focus
Datadog LLM Observability	Already standardized on Datadog

The decision question: do you already have an observability backbone? If yes, integrate with it. If no, Langfuse (OSS) or LangSmith (turnkey) are the two starting points.

Connecting telemetry to the rest of the program

Telemetry is not a standalone activity. It feeds three consumers:

Spec gap log (The Living Spec) — anomalies, surfaces, and incident traces become candidate spec-gap entries.
Eval suite (Evals and Benchmarks Level 4) — production traces are the source for golden-set construction.
Red-team protocol (Red-Team Protocol) — the alert layer is what triggers anomaly-driven investigation.

A program with telemetry that doesn't feed those three consumers is collecting data; a program where it does is learning.

Therefore

Production telemetry is a designed system, not a default. Capture per-task essentials (correlation ID, versions, tokens, cost, outcome) and per-step details (tool calls, latency, model tier). Capture content by reference; sanitize PII; never log credentials. Adopt one standard stack rather than building custom. Emit OpenTelemetry GenAI spans for portability. Distinguish alerts (real-time) from monitors (retrospective). Wire the stream into the spec gap log, the eval suite, and the red-team protocol — without that loop, telemetry collects data; with it, the program improves.

References

OpenTelemetry. Semantic conventions for GenAI. opentelemetry.io/docs/specs/semconv/gen-ai.
LangSmith (docs.smith.langchain.com); Langfuse (langfuse.com); Phoenix (arize.com/docs/phoenix); Helicone (helicone.ai); Datadog LLM Observability (docs.datadoghq.com/llm_observability/).
OpenInference initiative. github.com/Arize-ai/openinference.

Connections

This pattern assumes:

This pattern enables:

Cost and Latency Engineering — telemetry is the input that makes cost engineering possible
Evals and Benchmarks — Level 4 production sampling consumes the trace stream
Red-Team Protocol — anomaly investigation depends on the alert layer
The Living Spec — the spec gap log consumes telemetry-driven anomalies as candidate entries

The Architecture of Intent