Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Designing Agency Responsibly

We are building systems that act. Not systems that respond, not systems that suggest — systems that do things in the world. They write code that runs in production. They send messages to customers. They make decisions that affect people’s work and lives. This is agency, and agency demands responsibility.

This chapter is not about AI ethics in the abstract. It is about the concrete design decisions that determine whether an agentic system is trustworthy.

What Agency Means

Agency is the capacity to act independently in pursuit of a goal. A thermostat has minimal agency — it acts (turns on the heater) independently (without human intervention) in pursuit of a goal (reaching the set temperature). An Agentic OS has significantly more: it interprets goals, plans multi-step strategies, chooses among alternatives, and adapts to feedback.

More agency means more capability. It also means more ways to cause harm.

The harm is rarely dramatic. Agentic systems are unlikely to “go rogue” in cinematic fashion. The real risks are mundane:

  • A system that optimizes for speed and deploys untested code.
  • A system that resolves support tickets by giving customers wrong information.
  • A system that automates away a decision that required human judgment.
  • A system that perpetuates a bias present in its training data or memory.
  • A system that leaks sensitive information by including it in a context window sent to a third-party model.

These are not hypothetical risks. They are engineering failures that happen when agency is granted without adequate design discipline.

Principles for Responsible Agency

1. Autonomy Must Be Earned, Not Assumed

No agentic system should start with full autonomy. The staged autonomy model (Chapter 24) is not just a safety mechanism — it is a trust-building protocol. The system begins with narrow autonomy, demonstrates reliability, and earns broader independence over time.

Trust escalation must be:

  • Observable: The operator can see what the system has done and how well.
  • Gradual: Autonomy increases in small, verifiable steps.
  • Revocable: Trust can be reduced at any time if the system’s performance degrades.
  • Scoped: Higher autonomy in one domain does not automatically grant it in another.

2. Actions Must Be Explainable

An agentic system that cannot explain why it did something is an opaque risk. Every significant action must be traceable to:

  • The intent that motivated it.
  • The plan that included it.
  • The policy that permitted it.
  • The evidence that justified it.

This does not mean the system must produce a philosophical justification for every file read. It means the audit trail must be sufficient for a human to reconstruct the reasoning chain after the fact.

Explainability has a design cost — logging, structured plans, decision metadata. This cost is non-negotiable. A system that is fast but unexplainable is a system that will be shut down after its first serious error.

3. Harm Boundaries Must Be Hard

Governance policies come in two categories: preferences and boundaries.

  • Preferences are guidelines the system should follow but may deviate from when justified. “Keep pull requests under 300 lines” is a preference.
  • Boundaries are rules the system must never violate, regardless of context. “Never expose customer personal data in logs” is a boundary.

Hard boundaries must be enforced architecturally, not just instructionally. Telling a language model “never do X” is not enforcement — it is a suggestion. Enforcement means the tool layer physically cannot perform the action, or the governance middleware blocks it before execution.

Design hard boundaries for:

  • Data privacy violations.
  • Unauthorized access to systems.
  • Actions that bypass approval workflows.
  • Modifications to safety-critical systems without verification.
  • Resource consumption beyond defined limits.

4. The Human Must Remain in Control

Agency is delegated, not transferred. The human operator is always the ultimate authority. This means:

  • Kill switch: The operator can halt all system activity immediately.
  • Override: The operator can override any system decision.
  • Audit: The operator can review any action the system has taken.
  • Reconfiguration: The operator can change the system’s autonomy levels, policies, and boundaries at any time.

“The human is in control” is easy to say and hard to implement at scale. When an Agentic OS has dozens of active workers processing requests in parallel, what does “control” look like? It means:

  • Progress dashboards that show what is happening now.
  • Alert systems that surface anomalies in real time.
  • Approval queues that present decisions cleanly and allow batch processing.
  • Configuration interfaces that make policy changes immediate and system-wide.

Control that requires the operator to monitor every action defeats the purpose of agency. Control must be structural — built into the architecture so that the system’s default behavior is safe, and the operator intervenes only for exceptions.

5. Failure Must Be Visible

An agentic system that fails silently is more dangerous than one that fails loudly. When something goes wrong, the system must:

  • Detect the failure (through checking, validation, monitoring).
  • Report the failure (to the operator, in the audit log, through alerting).
  • Contain the failure (through circuit breakers, transaction rollback, sandboxing).
  • Learn from the failure (through memory updates, policy refinements).

The worst failure mode is one where the system does the wrong thing and everyone believes it did the right thing. This is why post-action validation, result verification, and human review points exist — not because the system is unreliable, but because even reliable systems occasionally fail, and undetected failures compound.

Design Patterns for Responsibility

Graduated Response

Instead of binary decisions (act or don’t act), use graduated responses:

  • Confidence > 95%: Act autonomously.
  • Confidence 70-95%: Act but flag for review.
  • Confidence 40-70%: Propose action, wait for approval.
  • Confidence < 40%: Report uncertainty, ask for guidance.

Thresholds are calibrated per domain. Production database changes might require 99% confidence for autonomous action. Documentation updates might require only 60%.

Red Team Process

For high-stakes systems, include a dedicated adversarial worker that reviews plans before execution:

  • “What could go wrong with this plan?”
  • “What assumptions are we making?”
  • “What is the worst-case outcome?”

The red team worker does not block execution for routine tasks. It activates when the risk assessment exceeds a threshold.

For actions that affect people (sending messages, modifying accounts, making commitments), the system verifies that the operator has consented to the type of action, not just the specific instance:

  • “You have authorized this system to send customer communications. This action sends an email to 450 customers about a service disruption. Proceed?”

The verification message is specific enough for informed consent — it includes the scope, the audience, and the impact.

Impact Logging

Beyond audit trails for compliance, maintain impact logs that track the real-world consequences of actions:

  • “Deployed v3.2.1 → Error rate decreased from 2.1% to 0.3% → 4 related support cases resolved.”
  • “Sent pricing update email → 12 replies received, 3 negative, 9 neutral → No escalations.”

Impact logs close the feedback loop between actions and outcomes, enabling the system to calibrate its confidence and the operator to evaluate the system’s value.

The Ethics of Delegation

When you delegate work to an agentic system, you are not delegating responsibility. The human who authorizes the system to act remains responsible for its actions. This has implications:

  • Operators must understand what they are authorizing. The system must present its capabilities and limitations clearly. An operator who does not understand the system’s behavior cannot meaningfully authorize it.
  • Organizations must define accountability structures. Who is responsible when the system makes a mistake? The operator who authorized it? The intent engineer who designed it? The organization that deployed it? These questions must be answered before the system is deployed, not after an incident.
  • The system must not obscure its nature. When an agentic system communicates with humans (customers, colleagues, external parties), the recipients should know they are interacting with an automated system unless there is a compelling and disclosed reason otherwise.

Building Trust

Trust is the currency of agency. A system with high trust operates with high autonomy, delivering maximum value. A system with no trust operates with no autonomy, delivering no value. Everything in this book — the governance plane, the staged autonomy model, the audit trail, the explainability requirements — is in service of one goal: building and maintaining trust.

Trust is built slowly and lost quickly. One unexplained failure, one policy violation, one data leak can reset trust to zero. This asymmetry is why responsible design is not a feature to be added — it is the foundation on which everything else depends.

Design for trust. Build for transparency. Default to safety. Escalate to humans. And never, ever deploy an agentic system that you would not want to explain to the person affected by its actions.