Health Check and Heartbeat

"Don't send work to a service that's already down."

Context

An agent depends on external services — MCP servers, APIs, databases, knowledge bases. These services may be unavailable, degraded, or slow. The agent discovers this only when it tries to use them, which may be mid-task.

Problem

Without health checks, the agent discovers service unavailability at the worst possible time — during execution, after partial work is already done. The failure interrupts the pipeline, may leave state in an inconsistent condition, and the diagnostic is "the tool didn't respond" rather than "the service was down before we started."

Concrete scenario: A code generation pipeline depends on three services: a code-analysis API, a database of design patterns, and a linter service. At 2 AM, the pattern database goes down for maintenance (unscheduled, brief). A user triggers the pipeline at 2:01 AM. The orchestrator doesn't know the database is down. The pipeline runs: fetches code, analyzes it (successful), generates initial design (successful state is checkpointed), tries to enhance design with patterns, times out waiting for the database. The enhancement fails. The database comes back online at 2:05 AM. By then, the user has been waiting 4 minutes for a timeout, and the pipeline must be manually resumed. If the health check had run at 2:01, the pipeline would have said "pattern database unavailable, waiting for recovery" and retried at 2:03, completing cleanly.

Forces

Need to know service status before executing vs. cost and latency of health checks (every check is a network call)
Need to act on degradation (route around it) vs. need to not overreact to transient failures (false positives cause thrashing)
Need long-running services to stay healthy (heartbeats) vs. heartbeat false positives (service stopped heartbeating because the heartbeat endpoint crashed, not the service)
Need failure-before-execution vs. need some retries (sometimes services recover in a second)

The Solution

Implement health checks for critical dependencies. Verify availability before dispatching work.

Each MCP server and critical API exposes a health endpoint. The endpoint returns current status: healthy, degraded, or unavailable.
The pipeline checks health before execution. If a critical dependency is unhealthy, the pipeline either waits, falls back, or fails explicitly — before investing in partial execution.
Long-running agents send heartbeats. A service that hasn't sent a heartbeat within its declared interval is presumed degraded.
Health status feeds into routing. If the primary service is degraded, route to the fallback (if one exists) or queue the request for retry.

Example: The code generation pipeline. The spec declares:

dependencies:
  - name: "code_analyzer"
    type: "api"
    health_check:
      endpoint: "https://analyzer.internal/health"
      interval_seconds: 30
      timeout_seconds: 2
      required: true
  - name: "pattern_database"
    type: "database"
    health_check:
      endpoint: "pattern-db.internal:5432/health"
      interval_seconds: 30
      timeout_seconds: 3
      required: true
      fallback: "pattern_cache"
  - name: "linter"
    type: "service"
    health_check:
      endpoint: "https://linter.internal/health"
      interval_seconds: 60
      required: false

At pipeline start (2:01 AM), the orchestrator checks health: code_analyzer ✓, pattern_database ✗ (timeout), linter ✓. Pattern database is required but has a fallback (pattern_cache). The pipeline proceeds using pattern_cache instead of pattern_database. When pattern_database recovers, the next pipeline execution (or a manual retry) uses it again. No timeout, no failure partway through.

Resulting Context

Explicit failure-before-execution — pipelines don't begin if critical services are already down
Graceful degradation is possible — fallback services are used when primaries are unhealthy
Recovery is automatic — health checks are performed regularly; degraded services need not be manually invoked again once they recover
Root cause is clear — "service was unavailable at execution start" vs. "service timed out during execution"

Therefore

Check critical service health before dispatching agent work. Prefer failing explicitly before execution over failing midway. Expose health endpoints on MCP servers and critical APIs. Route around degraded services when fallbacks exist.

Connections

The MCP Server — MCP servers should expose health endpoints
Graceful Degradation — health checks inform degradation decisions
Event-Driven Agent Activation — health checks prevent dispatching events to unhealthy agents
Sequential Pipeline — pre-execution health checks prevent partial pipeline failures

The Architecture of Intent