Cover

The Agentic OS

An Operating System Model for Governed Agency

The next abstraction in software is not the agent. It is the operating system around agency.

What This Book Is About

This book proposes a new abstraction for intelligent software: the Agentic OS.

Rather than treating agents as chatbots with tools, it frames them as governed operational systems composed of a cognitive kernel, isolated process fabric, layered memory, operator fabric, and a governance plane. From first principles to design patterns to reference architectures, the book explores how to build systems that solve problems through intention, reasoning, action, and adaptation — with performance, safety, extensibility, and reuse in mind.

Why This Book Exists

Software is entering a new phase: from executing instructions to operationalizing intent.

Traditional software assumes deterministic instructions. Agentic software must operate under partial information, evolving context, bounded trust, and real-world side effects. The chatbot mental model — a linear conversation with a model that has tools — does not scale. It does not compose. It does not govern.

We need a new operational model. Not just apps. Not just assistants. But governed systems of intention.

That is where the Agentic OS enters.

Who This Book Is For

This book is for software engineers, architects, and technical leaders who are building or evaluating agentic systems and want to move beyond ad-hoc prompt engineering toward principled, composable, and governed designs.

You will benefit most if you:

Have experience building software systems and understand why abstractions matter
Are working with or evaluating LLM-based agents and multi-agent workflows
Want a vocabulary and a set of patterns for designing agentic systems that are reliable, safe, and extensible

How This Book Is Organized

The book follows a deliberate arc:

Theory and Foundations — Why software is changing, why the chatbot model breaks, and why operating systems provide the right conceptual framework.
The Agentic OS Model — The core abstraction: cognitive kernel, process fabric, memory plane, operator fabric, and governance plane.
Design Patterns — A pattern language organized by domain: kernel, process, memory, operator, governance, runtime, and evolution patterns.
Solving Problems the Agentic Way — How to transform requests into intent, decompose problems, and execute with the plan-act-check-adapt loop.
Building the System — Reference architecture, component boundaries, extensibility, and performance.
Case Studies — Concrete applications: Coding OS, Research OS, Support OS, Knowledge OS, and multi-OS coordination.
Toward a New Discipline — From software engineering to intent engineering, responsible agency, and the future of operational intelligence.

Core Thesis

This book stands on three pillars:

Systems thinking. Agentic systems are operational systems, not prompt tricks.

Pattern language. We need reusable abstractions, not ad-hoc workflows.

Governed agency. The future is not autonomous chaos, but structured, bounded, auditable agency.

Agentic systems require the same class of abstractions that made operating systems reliable: isolation, scheduling, memory discipline, permissions, observability, and extensibility. This book maps those abstractions to the world of intelligent software and formalizes them as design patterns you can use, compose, and extend.

Hands-On Implementations

This book comes with working reference implementations you can use today. Each case study in Part VI has a corresponding workspace with real VS Code Copilot agents, skills, MCP configurations, and tutorials:

github.com/marcelaldecoa/TheAgenticOS/implementations

Copy a .github/ folder into your project and the agents are live.

The Shift: From Code Execution to Intent Execution

We do not merely need smarter models. We need better operational systems for intelligence.

The Classical Model

For decades, software has operated on a simple contract: the developer writes explicit instructions, and the machine executes them deterministically. Input flows in, logic applies, output flows out. Every branch is anticipated. Every state is designed. The program does exactly what it is told — nothing more, nothing less.

This model gave us extraordinary things: databases, operating systems, web applications, distributed systems. It is the foundation of the modern digital world. And it rests on a deep assumption:

The developer knows, at design time, the full space of possible behaviors.

The Agentic Model

Agentic software breaks that assumption.

An agentic system does not receive explicit instructions for every scenario. It receives intent — a goal, a constraint, a desired outcome — and must figure out how to achieve it. It must:

Interpret ambiguous requests
Decompose complex goals into subtasks
Choose which tools and resources to use
Decide when to act, when to ask, and when to stop
Operate under partial information and evolving context
Handle real-world side effects with appropriate caution

This is not a minor evolution. It is a shift in the fundamental contract between human and software.

Why Prompt-Plus-Tools Is Not Enough

The first wave of agentic software bolted tools onto language models and called it a day. “Give the model a function list and let it decide.” This works for demos. It breaks for production.

Why?

No isolation. Every tool call shares the same context. One bad invocation poisons the rest.
No scheduling. There is no principled way to prioritize, sequence, or parallelize work.
No memory discipline. Context grows until it overflows. There is no compression, no tiering, no eviction.
No governance. The model decides what to call, with what arguments, without explicit permissions.
No observability. You cannot audit what happened, why, or whether it was correct.

Prompt-plus-tools is the equivalent of running all your code in a single process with root access, no filesystem, and no logging. It works until it doesn’t. And when it doesn’t, you have no way to diagnose or recover.

Systems Thinking Is the Answer

The shift from code execution to intent execution demands a new kind of thinking. Not just better prompts. Not just more tools. But systems thinking — the discipline of designing structures that can handle complexity, uncertainty, and change.

Operating systems solved this for classical computing. They introduced processes, memory management, file systems, schedulers, permissions, and drivers — abstractions that made it possible to build reliable, scalable, composable software.

Agentic systems need the same class of abstractions. That is the thesis of this book.

From Programs to Operational Systems

Beyond Procedures

For most of computing history, software has been procedural at its core. Even object-oriented, functional, and event-driven paradigms ultimately compile down to sequences of instructions executed by a processor. The program is a procedure. The runtime is an executor.

Agentic software demands something different. The software is no longer only a procedure — it becomes an operational environment for reasoning and action. The system does not just run code; it interprets goals, retrieves context, selects strategies, delegates tasks, validates results, and adapts.

This is a qualitative shift. The program becomes a system.

What Makes It Operational

An operational system is defined by its ability to:

Accept intent — not just input, but goals with constraints and context
Maintain state — working memory, long-term memory, and operational metadata
Coordinate work — plan, delegate, parallelize, sequence, and synchronize
Govern action — enforce policies, permissions, and boundaries
Adapt over time — learn from outcomes, compress experience, evolve strategies

No single component delivers all of this. It requires a system — a composition of interacting parts with clear responsibilities, boundaries, and contracts.

The Progression

Era	Unit of Software	Runtime Model
Procedural	Function	Execute instructions
Object-Oriented	Object	Send messages between objects
Service-Oriented	Service	Call APIs across boundaries
Agentic	Operational System	Interpret intent, coordinate action

Each era introduced new abstractions to manage growing complexity. The agentic era requires its own.

Agentic Systems Need Runtime Structure

A language model with tool access is not an operational system. It is a powerful component without structure around it. Structure is what turns a component into a system:

A kernel that routes intent and coordinates reasoning
Processes that isolate work into bounded, manageable units
Memory services that tier, compress, and retrieve context efficiently
Operators that provide controlled access to external capabilities
Governance that enforces policy, permissions, and auditability

Without this structure, agentic software is fragile, opaque, and ungovernable. With it, agentic software becomes an operational system — composable, observable, and safe.

The Limits of the Chatbot Mental Model

The Default Mental Model

When most people think of AI agents, they think of chatbots. A conversation thread. A user types a message. The model responds. Maybe it calls a tool. The conversation continues. History accumulates.

This mental model is seductive because it is simple. But simplicity is not the same as sufficiency. The chatbot model breaks in fundamental ways when applied to real-world agentic work.

Where It Breaks

Linear History

Chatbots treat interaction as a flat sequence of messages. There is no branching, no parallel threads, no hierarchy of tasks and subtasks. Every exchange is appended to one growing conversation. This makes it impossible to represent the kind of structured, concurrent work that real problem-solving requires.

Poor Isolation

In a chatbot, everything happens in one shared context. A tool call to read a database, a reasoning step about strategy, and a response to the user all occur in the same space. There is no boundary between concerns. A failure in one area can corrupt the others.

No Explicit Permissions

Chatbots do not have a permission model. The model can call any tool it has access to, with any arguments, at any time. There is no concept of “this agent may read but not write,” or “this action requires approval.” The security model is all-or-nothing.

No Real Process Model

There is no concept of spawning a worker, scoping its task, limiting its context, and collecting its result. Every step happens in the main thread with the full context. This is like running an operating system with one process and no memory protection.

Weak Memory Discipline

Context windows fill up. Older messages get truncated or lost. There is no tiered memory. No compression. No selective retrieval. No distinction between working memory, episodic memory, and long-term knowledge. The system forgets randomly rather than strategically.

Low Auditability

When something goes wrong in a chatbot, the debugging experience is reading the transcript. There are no structured logs, no execution traces, no decision records, no policy evaluations. You cannot replay, inspect, or verify what happened and why.

The Cost of the Wrong Model

These are not minor inconveniences. They are architectural limitations that prevent agentic systems from being reliable, safe, and composable at scale. Building production agentic software on the chatbot model is like building a web application with global variables and no error handling. It works in demos. It fails in reality.

We Need a Better Abstraction

The chatbot model is a UI pattern, not an architectural pattern. It describes how humans interact with models, not how systems should be structured. To build agentic systems that work, we need to separate the interaction model from the operational model — and design the operational model with the rigor it deserves.

Why Operating Systems Are the Right Analogy

Agentic systems need the same class of abstractions that made operating systems scalable and reliable.

The Insight

Operating systems are humanity’s most successful answer to a specific class of problem: how do you take a powerful but dangerous resource (a processor, memory, I/O devices) and make it safe, shared, composable, and observable?

Agentic systems face the same class of problem with a different resource: intelligence. An LLM is powerful but dangerous. It can reason, but also hallucinate. It can act, but also cause harm. It can process vast context, but also lose track. It needs the same kind of operational structure that computing hardware needed fifty years ago.

The Mapping

flowchart LR
  subgraph OS["Traditional OS"]
    direction TB
    K1[Kernel]
    P1[Process]
    M1[Memory]
    F1[Filesystem]
    S1[Scheduler]
    C1[Capabilities]
    SC1[Syscalls]
    D1[Drivers]
    A1[Audit Log]
  end
  subgraph AOS["Agentic OS"]
    direction TB
    K2[Cognitive Kernel]
    P2[Subagent / Worker]
    M2[Layered Memory Plane]
    F2[Knowledge Store]
    S2[Task Planner]
    C2[Permission Model]
    SC2[Tool Invocations]
    D2[Operator Adapters]
    A2[Execution Journal]
  end

  K1 --> K2
  P1 --> P2
  M1 --> M2
  F1 --> F2
  S1 --> S2
  C1 --> C2
  SC1 --> SC2
  D1 --> D2
  A1 --> A2

OS Concept	Agentic Equivalent	Purpose
Kernel	Cognitive Kernel	Routes intent, coordinates reasoning, enforces policy
Process	Subagent / Worker	Isolated unit of work with bounded context and lifecycle
Memory	Layered Memory Plane	Working, episodic, semantic memory with compression and retrieval
Filesystem	Knowledge Store	Persistent, structured access to documents, data, and artifacts
Scheduler	Task Planner	Prioritizes, sequences, and parallelizes work
Capabilities	Permission Model	Scoped access to tools, data, and actions
Syscalls	Tool Invocations	Controlled interface between agents and external capabilities
Drivers	Operator Adapters	Translate between the system and specific external services
Audit Log	Execution Journal	Records decisions, actions, and outcomes for inspection

This is not a metaphor. It is a structural correspondence. The problems are isomorphic. The abstractions that solved them for computing solve them for agency.

Where the Analogy Breaks — and Why That Matters

The correspondence above is structural, but not complete. There is one fundamental difference between a traditional OS and an Agentic OS, and acknowledging it honestly is essential to building systems that work.

A traditional OS governs a deterministic substrate. Given the same instructions and state, the CPU produces the same result every time. Memory reads return exactly what was written. I/O follows defined protocols. The kernel can make scheduling and isolation decisions with confidence because the underlying hardware is regular and predictable.

An Agentic OS governs a stochastic substrate. The same prompt may produce different outputs across invocations. Model behavior varies with temperature, context ordering, version changes, and quantization. Tools have variable latency and failure modes. The “processor” at the heart of the system — the language model — is fundamentally nondeterministic.

This changes what governance means. In a traditional OS, the kernel enforces rules and the hardware obeys. In an Agentic OS, the kernel enforces rules and the model mostly complies — but may hallucinate, drift, or produce unexpected outputs despite correct instructions. Governance here is not about controlling a reliable machine. It is about containing an unreliable one.

This is why the analogy must be extended, not merely copied:

Scheduling cannot assume deterministic completion times. Task planning must be adaptive, with timeouts calibrated to observed latency distributions rather than fixed deadlines.
Isolation is not just about preventing resource conflicts — it prevents correlated failures, where one model hallucination cascades through shared state into other workers.
Permissions are necessary but insufficient. A traditional process that has write access will write what it is told. An agent with write access may write something no one asked for. Governance must therefore validate outputs, not just authorize actions.
Memory is not just an optimization concern. Because the model’s behavior is context-dependent and context itself is noisy, memory discipline — what enters working memory, what is compressed, what is evicted — directly affects the reliability of the system, not just its efficiency.

The Agentic OS responds to this reality with structures that have no counterpart in traditional operating systems: continuous policy evaluation (not one-time authorization), plan-act-check-adapt loops (not execute-and-return), reflective retry (not blind retry), and staged autonomy (trust earned through observed behavior, not declared in advance). These are not embellishments on the OS model. They are the necessary extensions that make the model work when the substrate is stochastic.

The OS analogy holds — but it holds because we extend it to account for nondeterminism, not because we pretend nondeterminism does not exist.

What OS Abstractions Give Us

Isolation

Processes run in their own address space. Subagents operate in their own context sandbox. One failure does not cascade.

Scheduling

The OS decides which process runs when, based on priority and resources. The Agentic OS decides which task to execute, how to parallelize, and when to preempt.

Memory Discipline

Virtual memory, paging, caching, eviction — the OS manages memory as a tiered resource. The Agentic OS manages context the same way: working memory for immediate tasks, compressed episodic memory for recent history, semantic memory for long-term knowledge.

Permissions

The OS enforces who can read, write, and execute what. The Agentic OS enforces which agents can invoke which tools, access which data, and perform which actions.

Extensibility

Device drivers let the OS support new hardware without changing the kernel. Operator adapters let the Agentic OS support new tools and services without changing the core system.

Observability

System logs, process tables, resource monitors — the OS makes its internal state visible. The Agentic OS needs the same: execution journals, decision traces, policy evaluation records.

The Thesis

The operating system abstraction is not one possible framing among many. It is the right framing — the one that naturally produces the properties we need:

Performance through scheduling, caching, and context management
Efficiency through isolation, bounded execution, and resource envelopes
Reusability through well-defined interfaces, operators, and skills
Extensibility through registries, adapters, and governed plugin models

This book develops this analogy into a full architectural framework: the Agentic OS.

Governing Intelligence

The First Deep Principle

Before we design architectures, define patterns, or write code, we must establish a foundational principle:

Intelligence without governance is unsafe.

A system that can reason, plan, and act is powerful. A system that can do all of that without boundaries is dangerous. The history of computing teaches us this lesson repeatedly: power without structure leads to fragility, unpredictability, and harm.

Why Governance Matters More Than Capability

The AI industry is obsessed with capability. Bigger models. More tools. Longer context windows. Faster inference. But capability without governance is like horsepower without brakes. It does not make you faster — it makes you reckless.

Governed intelligence means:

Every action has a policy. The system does not just decide what to do; it evaluates whether it is allowed to do it, whether the risk is acceptable, and whether approval is needed.
Every agent has boundaries. No agent operates with unlimited scope. Each one has a defined role, a limited context, and explicit permissions.
Every decision is auditable. It is not enough to produce the right output. The system must be able to explain how it arrived there and what policies it evaluated along the way.

The Spectrum of Autonomy

Governance is not binary. It is a spectrum:

Level	Description	Example
Full human control	Agent proposes, human decides	“Here is my plan. Shall I proceed?”
Gated autonomy	Agent acts within approved boundaries, escalates otherwise	“I can refactor this file, but renaming the public API requires approval.”
Staged autonomy	Agent earns broader permissions over time through track record	“This agent has completed 50 low-risk deployments without incident. Elevate to medium-risk.”
Bounded autonomy	Agent acts freely within a defined envelope	“Process up to 100 records per batch. Never delete production data.”

The art of agentic system design is choosing the right level for each context — and making the transitions explicit, observable, and reversible.

Governance as Architecture

Governance is not a feature you add at the end. It is an architectural concern that shapes every layer of the system:

The kernel evaluates policies before routing intent
The process fabric enforces context boundaries and capability scoping
The memory plane controls what information flows where
The operator fabric gates access to tools and external services
The governance plane itself provides the rules, permissions, and audit infrastructure

When governance is an afterthought, the system is ungovernable. When governance is architectural, the system is safe by design.

The Principle

Agency requires boundaries. The most capable systems are not the ones with the most freedom, but the ones with the most principled constraints.

This principle will return throughout the book. It shapes every pattern, every architectural decision, and every case study. It is the foundation on which the Agentic OS is built.

What Is an Agentic OS?

An Agentic OS is a governed runtime that interprets intent, coordinates reasoning, manages layered memory, invokes operators safely, and acts within explicit policies.

Definition

An Agentic OS is not a chatbot with tools. It is not a wrapper around a language model. It is not an automation framework with AI steps.

An Agentic OS is a governed operational system designed to:

Interpret intent — Transform ambiguous human requests into structured goals with constraints, resources, risk profiles, and success conditions
Coordinate reasoning — Plan, decompose, delegate, and consolidate cognitive work across multiple specialized workers
Manage memory — Maintain tiered, disciplined memory: working, episodic, semantic, and operational — with compression, retrieval, and eviction strategies
Invoke operators — Access external capabilities (tools, APIs, services) through controlled, permissioned interfaces
Govern action — Enforce policies, evaluate risk, gate permissions, and audit every decision and side effect

What Makes It an “Operating System”

The term is not metaphorical. An Agentic OS provides the same functional categories as a traditional OS, applied to intelligence rather than hardware:

Resource management — Context windows, token budgets, and memory tiers are finite resources that must be allocated, shared, and reclaimed
Process isolation — Workers must not corrupt each other’s context or assumptions
Scheduling — Work must be prioritized, sequenced, and parallelized
Permission enforcement — Not every agent should be able to do everything
Extensibility — New capabilities must be added without destabilizing the core system
Observability — The system’s behavior must be inspectable, auditable, and debuggable

What It Is Not

It is not…	Because…
A prompt template	Templates are static; an OS is a dynamic runtime
A tool-calling framework	Frameworks provide plumbing; an OS provides governance
An agent orchestrator	Orchestrators coordinate; an OS also isolates, governs, and remembers
A chatbot with memory	Chatbots are interaction patterns; an OS is an architectural pattern

The Central Proposition

If you accept that:

Agentic software must operate under uncertainty, partial information, and real-world side effects
The chatbot model cannot provide the structure needed for reliable, safe, composable agentic work
Operating systems solved an isomorphic problem for computing hardware

Then the Agentic OS is the natural next abstraction. This chapter names it. The following chapters define its layers, components, and patterns.

Core Layers of the Agentic OS

This is the architectural heart of the book. The Agentic OS is composed of distinct layers, each with clear responsibilities, boundaries, and interfaces.

The Stack

block-beta
  columns 1
  EL["Experience Layer\nHuman interface"]
  CK["Cognitive Kernel\nIntent routing, planning, coordination"]
  PF["Process Fabric\nIsolated workers, task lifecycle"]
  columns 2
  MP["Memory Plane\nState & knowledge"]
  OF["Operator Fabric\nTools & services"]
  columns 1
  GP["Governance Plane\nPolicies, permissions, audit"]
  EE["Execution Environment\nInfrastructure, models, runtimes"]

  style EL fill:#1a5740,stroke:#3aaf7a,color:#e0f5ec
  style CK fill:#134a36,stroke:#3aaf7a,color:#e0f5ec
  style PF fill:#134a36,stroke:#3aaf7a,color:#e0f5ec
  style MP fill:#0f3a2c,stroke:#2dd4bf,color:#e0f5ec
  style OF fill:#0f3a2c,stroke:#2dd4bf,color:#e0f5ec
  style GP fill:#2b1f4e,stroke:#a78bfa,color:#e0f5ec
  style EE fill:#0a2a1e,stroke:#3aaf7a,color:#7abfa8

Layer Responsibilities

Experience Layer

The boundary between humans and the system. It translates human communication into structured intent and presents system output as meaningful responses. It is a UI concern, not an intelligence concern.

Cognitive Kernel

The brain of the system. It receives interpreted intent, decides how to approach the problem, creates plans, delegates to workers, consolidates results, handles failures, and evaluates policies. It is the scheduler, the coordinator, and the decision-maker.

Process Fabric

The runtime for workers. Each task is executed by an isolated process (subagent) with its own context sandbox, scoped capabilities, and lifecycle. The process fabric manages spawning, monitoring, context boundaries, and result collection.

Memory Plane

The system’s memory infrastructure. It provides tiered storage — working memory for the current task, episodic memory for recent interactions, semantic memory for long-term knowledge, and operational state for system metadata. It handles compression, retrieval, contradiction pruning, and eviction.

Operator Fabric

The system’s interface to the outside world. Tools, APIs, MCP servers, and external services are accessed through operators — controlled, typed, permissioned action surfaces. The operator fabric provides registration, discovery, composition, isolation, and fallback.

Governance Plane

The system’s policy engine. It defines what agents can do, under what conditions, with what approval, and with what audit trail. It enforces capability-based permissions, evaluates risk, manages escalation, and maintains the execution journal.

Execution Environment

The infrastructure that runs everything: LLM providers, embedding models, vector stores, compute resources, network access. This layer is mostly invisible to the rest of the system, abstracted behind clean interfaces.

Design Principles Across Layers

Clear boundaries. Each layer has a defined interface. No layer reaches into another’s internals.
Governance throughout. The governance plane is not a top layer — it cuts across all layers. Every action, at every level, is subject to policy.
Composability. Layers can be replaced, extended, or specialized independently.
Observability. Every layer emits structured telemetry that feeds into the execution journal.

Why Layers Matter

Without layering, agentic systems become monoliths — everything tangled together, impossible to debug, impossible to extend, impossible to govern. Layering provides:

Separation of concerns — Each problem is solved in one place
Independent evolution — Memory strategies can improve without changing the kernel
Testability — Each layer can be tested in isolation
Reuse — Layers can be shared across different domain-specific Agentic OS implementations

The following chapters explore each layer in depth.

The Cognitive Kernel

The cognitive kernel is to the Agentic OS what the kernel is to an operating system: the central coordinator that manages the most critical operations of the system.

Responsibilities

Intent Routing

When a request enters the system, the kernel interprets it and decides how to handle it. This is not simple keyword matching — it involves understanding the goal, identifying constraints, assessing complexity, and choosing an execution strategy.

Decomposition

Complex requests are broken down into subtasks. The kernel decides:

What can be done in parallel vs. sequentially
What requires specialized workers
What depends on external data
What needs human approval before proceeding

Planning

The kernel creates an execution plan — a structured representation of what needs to happen, in what order, with what resources, and under what constraints. Plans are not rigid scripts; they are adaptive frameworks that can be revised as execution proceeds.

Delegation

The kernel spawns workers in the process fabric, scoping each one with:

A clear task definition
The minimum necessary context
Explicit capabilities (what tools it can use)
Success criteria and failure boundaries

Scheduling

When multiple tasks compete for resources (context budget, model access, tool throughput), the kernel prioritizes. It decides what runs now, what waits, and what gets preempted.

Result Consolidation

As workers complete their tasks, the kernel collects, validates, and synthesizes their outputs into a coherent result. It handles conflicts, gaps, and contradictions.

Policy Evaluation

Before every significant action, the kernel consults the governance plane: Is this action allowed? Does it need approval? What is the risk level? This evaluation is continuous, not one-time.

The Kernel Loop

The cognitive kernel operates in a continuous loop:

flowchart LR
  P[Perceive] --> I[Interpret] --> Pl[Plan] --> D[Delegate] --> M[Monitor] --> C[Consolidate] --> A[Adapt]
  A -.->|cycle| P

Each cycle may trigger new cycles as the plan evolves, workers report back, or conditions change. The kernel is the system’s executive function — it does not do the work itself, but it decides what work gets done, by whom, with what resources, and under what rules.

What the Kernel Is Not

The kernel is not the language model. The language model is a resource the kernel uses, just as an OS kernel uses the CPU. The kernel is the logic that decides how to use that resource — when to invoke it, with what context, for what purpose, and how to interpret the result.

Design Considerations

Keep the kernel lean. The kernel coordinates; it does not execute domain logic. Heavy reasoning is delegated to workers.
Make planning explicit. Plans should be inspectable data structures, not implicit chains of thought.
Evaluate policy continuously. Do not check permissions once at the start. Check at each decision point.
Support plan adaptation. The initial plan is a hypothesis. Workers may discover information that changes everything. The kernel must be able to revise, extend, or abandon the plan.

Governing What You Cannot Predict

A traditional OS kernel issues instructions to a CPU and gets deterministic results. The cognitive kernel has no such guarantee. The language model at the center of the system is stochastic: the same input may produce different outputs, quality varies across invocations, and failure modes are difficult to anticipate. This is the hardest structural problem in agentic system design.

The cognitive kernel addresses this through four mechanisms:

1. Treating every output as a hypothesis. The kernel does not trust the first result from a model or a worker. It validates outputs against success criteria, checks for internal consistency, and compares results against known constraints. The consolidation phase of the kernel loop is not a formality — it is the primary quality gate.

2. Making uncertainty visible. Where a traditional kernel tracks process state (running, ready, blocked), the cognitive kernel also tracks confidence — how reliable was this result? Was it validated? Did it require retries? This metadata flows through the system so that downstream decisions can account for upstream uncertainty. A plan built on high-confidence inputs can proceed autonomously; one built on low-confidence inputs triggers additional validation or human review.

3. Designing for variance, not just errors. Traditional error handling distinguishes success from failure. The cognitive kernel must handle a third category: plausible but wrong. A model may return well-formatted, syntactically valid output that is factually incorrect or subtly misaligned with the goal. The kernel loop’s check phase must evaluate not just “did it complete?” but “does the result actually satisfy the intent?” This requires richer validation — semantic checks, cross-referencing, and sometimes redundant execution with comparison.

4. Earning trust incrementally. The kernel does not grant workers or models a fixed level of trust. It calibrates autonomy based on observed performance. A worker that consistently produces validated results earns broader delegation. One that produces inconsistent results gets tighter oversight, smaller tasks, or a different model. This is staged autonomy applied at the kernel level — the kernel adapts its own coordination strategy based on the empirical reliability of the components it orchestrates.

These mechanisms are not defensive measures bolted onto an otherwise clean design. They are the core of what makes the cognitive kernel different from a traditional OS kernel. The substrate is nondeterministic, so the coordinator must be adaptive. Every plan is a hypothesis. Every output is provisional. Every cycle of the kernel loop is an opportunity to detect divergence, correct course, and maintain coherence in a system that cannot guarantee deterministic behavior from its most fundamental component.

Process Fabric

In an operating system, the process is the fundamental unit of execution: an isolated program running in its own address space, with its own resources, lifecycle, and permissions. The Agentic OS needs the same abstraction.

Subagents as Processes

Every unit of delegated work runs as a subagent — an isolated worker with:

Bounded context — Only the information it needs, not the entire conversation history
Scoped capabilities — Only the tools and permissions relevant to its task
A clear task contract — What it must accomplish, how to report results, and when to escalate
A defined lifecycle — It is spawned, runs, completes (or fails), and is terminated

This is fundamentally different from the chatbot model, where everything happens in one monolithic context.

Why Isolation Matters

Without isolation, agentic systems suffer from:

Context pollution — Irrelevant information from one task confuses another
Capability creep — A worker intended for research ends up with write access to production
Failure cascades — One broken subtask corrupts the state of the entire system
Debugging nightmares — When everything runs in one space, tracing a problem to its source is nearly impossible

Isolation provides the inverse: clarity, safety, containment, and debuggability.

Process Types

Type	Purpose	Lifecycle
Ephemeral Worker	One-shot task, discarded after completion	Short
Scoped Worker	Sustained task with defined boundaries	Medium
Specialist	Domain expert invoked for specific capabilities	On-demand
Reviewer	Validates output of other workers	After primary work
Recovery	Handles failures and retries	Triggered by failure

The Task Contract

Every process operates under a task contract:

flowchart LR
  subgraph Contract["Task Contract"]
    direction TB
    T["Task: What to accomplish"]
    Ctx["Context: Information provided"]
    Cap["Capabilities: Tools & resources available"]
    Con["Constraints: What is not allowed"]
    Suc["Success: How to determine completion"]
    Fail["Failure: Error handling & escalation"]
    Out["Output: What to return to kernel"]
  end
  K[Kernel] -->|defines| Contract
  Contract -->|governs| W[Worker Process]

This contract is the interface between the kernel and the process fabric. It makes delegation explicit, inspectable, and governable.

Lifecycle Management

flowchart LR
  Spawn[Spawn\nCreate worker] --> Monitor[Monitor\nTrack progress & resources]
  Monitor -->|healthy| Monitor
  Monitor -->|completed| Complete[Complete\nCollect results & cleanup]
  Monitor -->|failed| Fail[Failure Handling\nDiagnose & recover]
  Monitor -->|over budget| Term[Terminate\nShut down worker]
  Fail -->|recoverable| Spawn
  Fail -->|escalate| Esc[Escalate to\nkernel or human]
  Complete --> Done[Return results\nto kernel]

The process fabric manages:

Spawning — Creating a new worker with its context and capabilities
Monitoring — Tracking progress, resource usage, and health
Completion — Collecting results and cleaning up resources
Failure handling — Detecting failures, triggering recovery processes, or escalating
Termination — Shutting down workers that exceed their boundaries or budgets

Parallelism

When tasks are independent, the process fabric can run them in parallel. This is one of the key performance advantages of the OS model — work that does not depend on each other should not wait for each other.

The cognitive kernel identifies parallelizable tasks during planning. The process fabric executes them concurrently and synchronizes their results.

Memory Plane

Memory is one of the hardest problems in agentic systems, and one of the most poorly solved. The chatbot model treats memory as a growing list of messages that eventually overflows. The Agentic OS treats memory as a managed, tiered, disciplined resource.

The Problem

Language models have a fixed context window. Everything the model can “think about” must fit in that window. Without memory discipline, agentic systems face a brutal tradeoff: either keep everything (and overflow) or discard things (and lose critical context).

Operating systems solved this problem decades ago with virtual memory, paging, caching, and tiered storage. The Agentic OS applies the same principles to cognitive context.

Memory Tiers

block-beta
  columns 1
  WM["Working Memory\n(Hot — active context, ephemeral)"]
  EM["Episodic Memory\n(Warm — interaction histories, decision records)"]
  SM["Semantic Memory\n(Cool — domain knowledge, learned patterns)"]
  OS["Operational State\n(System metadata, process states, configs)"]
  AH["Audit History\n(Cold — immutable action log, for humans)"]

  style WM fill:#c0392b,stroke:#e74c3c,color:#fff
  style EM fill:#d35400,stroke:#e67e22,color:#fff
  style SM fill:#2980b9,stroke:#3498db,color:#fff
  style OS fill:#27ae60,stroke:#2ecc71,color:#fff
  style AH fill:#2c3e50,stroke:#34495e,color:#e8f0fe

Working Memory

The immediate context of the current task. This is what the active process can “see” right now. It is small, focused, and ephemeral.

Current task definition
Relevant retrieved context
Intermediate reasoning results
Active plan state

Episodic Memory

Records of what has happened. Structured summaries of past interactions, decisions, and outcomes. Episodic memory answers: “What did we do before?”

Compressed interaction histories
Decision records
Outcome summaries
Failure logs

Semantic Memory

Long-term knowledge that is not tied to specific interactions. Facts, concepts, domain knowledge, learned patterns. Semantic memory answers: “What do we know?”

Domain knowledge bases
Learned patterns and heuristics
Entity relationships
Organizational knowledge

Operational State

System-level metadata about the current state of the Agentic OS itself. Not task content, but system health and status.

Active processes and their states
Resource utilization
Policy configurations
Pending approvals

Audit History

An immutable record of every action, decision, and policy evaluation. This is not for the model to reason over — it is for humans to inspect, debug, and verify.

Memory Operations

Operation	Purpose
Store	Write information to the appropriate tier
Retrieve	Pull relevant information into working memory
Compress	Summarize detailed records into compact representations
Evict	Remove information that is no longer needed
Reconcile	Resolve contradictions between memory tiers
Prune	Remove outdated or contradicted information

Memory Discipline

The key insight is that memory must be managed, not just accumulated. This means:

Selective retrieval — Only pull into working memory what the current task needs
Strategic compression — Summarize rather than discard; preserve the signal, reduce the noise
Contradiction detection — When new information conflicts with stored memory, resolve it explicitly
Budget enforcement — Each process has a context budget; the memory plane ensures it is respected

The OS Parallel

Just as an OS pages memory to disk when RAM is full and pages it back when needed, the memory plane moves context between tiers:

Hot context lives in working memory (fast, small)
Warm context lives in episodic memory (summarized, retrievable)
Cold context lives in semantic memory (indexed, searchable)
Frozen context lives in audit history (immutable, archival)

This tiering is what makes agentic systems efficient. Without it, context windows overflow and performance collapses.

Operator Fabric

An agentic system that cannot act on the world is just a thinking machine. The operator fabric is how the Agentic OS interacts with external capabilities — tools, APIs, databases, file systems, MCP servers, and any other service that produces side effects.

Operators as Controlled Action Surfaces

In an operating system, programs interact with hardware through syscalls — controlled, typed, permissioned interfaces. The program never touches the hardware directly. The kernel mediates every interaction.

The Agentic OS applies the same principle. Agents do not call tools directly. They invoke operators — controlled action surfaces that:

Have defined inputs and outputs
Are registered and discoverable
Are subject to governance (permissions, policies, approval flows)
Provide isolation (failures in one operator do not crash the system)
Are composable (operators can be chained and combined)

Tools as Semantic Syscalls

A tool invocation in an agentic system is structurally equivalent to a syscall:

flowchart LR
  subgraph OS["Traditional OS"]
    direction LR
    A1["write(fd, buffer, count)"] --> B1["Controlled I/O"]
  end
  subgraph AO["Agentic OS"]
    direction LR
    A2["search(query, max_results)"] --> B2["Controlled Action"]
  end

Both follow the same pattern: a typed interface, mediated by the kernel, subject to permissions, with structured results.

MCP as Peripheral Subsystems

Model Context Protocol (MCP) servers are like peripheral subsystems in an OS — external devices with their own drivers and protocols. The operator fabric provides the adapter layer that:

Discovers available MCP servers
Translates between the kernel’s internal representations and MCP protocols
Handles connection lifecycle, errors, and retries
Enforces governance on MCP interactions

The Operator Registry

All available operators are registered in a central registry:

classDiagram
  class file_read {
    +Input: path: string
    +Output: content: string
    +Permissions: read
    +Risk: low
  }
  class database_write {
    +Input: table: string, record: object
    +Output: id: string, success: boolean
    +Permissions: write
    +Risk: medium
    +Approval: required for production
  }
  class OperatorRegistry {
    +discover(capability) operators[]
    +register(operator) void
    +getPermissions(operator) capability[]
  }
  OperatorRegistry --> file_read
  OperatorRegistry --> database_write

The registry enables discovery, documentation, and governance. The kernel consults it when deciding which operators a worker can access.

Composition Over Improvisation

One of the most powerful aspects of the operator fabric is composition. Rather than asking the model to improvise complex multi-step workflows, the system composes operators:

Operator chains — Sequential pipelines (fetch → transform → store)
Operator sets — Groups of related operators exposed as a unit
Skills — Higher-level recipes that combine multiple operators with logic

Composition is more reliable than improvisation because each step is typed, tested, and governed.

Operator Isolation

When an operator fails, the failure is contained:

The operator’s error is captured and returned as a result
The calling process decides how to handle it (retry, fallback, escalate)
Other operators and processes are unaffected
The failure is logged in the execution journal

This is the same fault isolation that OS drivers provide — a bad driver should not crash the kernel.

Governance Plane

The governance plane is arguably the most important layer of the Agentic OS. Without it, every other layer is unsafe. With it, the system becomes trustworthy.

Why Governance Is a Plane, Not a Layer

Governance is not a single tier in the stack. It cuts across all layers. The kernel evaluates policy. The process fabric enforces capability scoping. The memory plane controls information flow. The operator fabric gates tool access. Governance is pervasive — it is the circulatory system, not a single organ.

We call it a “plane” because it exists in a different dimension than the vertical stack. Every component interacts with it. Every action passes through it.

Core Components

Capability-Based Permissions

Inspired by capability-based security in operating systems, each agent or worker receives an explicit set of capabilities — tokens that grant specific, scoped permissions.

flowchart LR
  subgraph Granted["Granted"]
    R["file.read\nscope: src/**"]
    D["git.diff\nscope: current-branch"]
    C["comment.create\nscope: pull-request"]
  end
  subgraph Denied["Denied"]
    W["file.write"]
    P["git.push"]
    Dep["deploy.*"]
  end
  Worker["Worker: code-reviewer"] --> Granted
  Worker -.->|blocked| Denied

Capabilities are granted, not assumed. No agent starts with full access.

Permission Gates

Before any action with side effects, the system evaluates a permission gate:

Does the agent have the required capability?
Does the action comply with the current policy?
Does the risk level require human approval?
Is the action within the resource envelope?

If any gate fails, the action is blocked and the decision is logged.

Approval Flows

Some actions are too important for automated approval. The governance plane supports approval flows:

Synchronous approval — The system pauses and waits for human confirmation
Asynchronous approval — The system queues the action and continues with other work
Delegated approval — A higher-privilege agent or role approves on behalf of the human

Trust Boundaries

The system defines explicit trust boundaries:

Internal boundary — Between core system components (kernel, memory, etc.)
Process boundary — Between the kernel and spawned workers
External boundary — Between the system and external tools/services
Human boundary — Between the system and human users/operators

Data and actions crossing boundaries are validated, sanitized, and logged.

Auditability

Every action carried out by the system is recorded in the execution journal:

What was done
Why (which intent, which plan step)
By whom (which agent, which process)
Under what policy
With what result
What side effects occurred

The execution journal is immutable. It is the system’s black box — the authoritative record of everything that happened.

Escalation

When the system encounters a situation it cannot resolve within its current permissions or confidence:

It identifies the escalation trigger (policy violation, confidence threshold, risk level)
It packages the context for escalation (what happened, what it wanted to do, why it is escalating)
It routes the escalation to the appropriate authority (human, higher-privilege agent, policy engine)
It waits for resolution before proceeding

Escalation is not a failure. It is a design feature. The best agentic systems escalate well.

The Governance Loop

flowchart LR
  Intent --> PC[Policy Check]
  PC -->|allowed| Action --> SE[Side Effect] --> Audit --> Feedback
  PC -->|denied| EB[Escalate or Block]

This loop runs for every action, at every level. It is the heartbeat of a trustworthy system.

Kernel Patterns

These patterns govern how the cognitive kernel interprets intent, plans work, and coordinates execution.

Intent Router

Intent

Route incoming requests to the appropriate execution strategy based on their nature, complexity, and requirements.

Context

When the kernel receives a new request, it must decide: Is this a simple task or a complex one? Does it need decomposition? Which specialists are required? What governance applies?

Forces

Requests vary enormously in complexity and type
Misrouting wastes resources or produces poor results
Over-analysis of simple requests adds unnecessary latency

Structure

The intent router classifies requests along dimensions: complexity (simple → compound → complex), domain (code, research, operations), risk level, and required capabilities. Each classification maps to an execution strategy.

flowchart TD
  R[Request] --> C{Classify}
  C -->|Simple| D[Direct Execution]
  C -->|Compound| P[Decompose & Parallel]
  C -->|Complex| F[Full Planning]
  D --> O[Output]
  P --> Cons[Consolidate] --> O
  F --> Plan[Plan-Act-Check-Adapt] --> Cons

Dynamics

Simple requests are executed directly. Compound requests are decomposed into independent subtasks. Complex requests trigger full planning with iterative refinement.

Benefits

Efficient resource usage. Simple things stay simple. Complex things get the structure they need.

Tradeoffs

Routing logic itself becomes a point of failure. Misclassification leads to under- or over-engineering a response.

Failure Modes

A complex request classified as simple produces shallow results. A simple request classified as complex wastes resources and adds latency.

Planner-Executor Split, Policy-Aware Scheduler

Planner-Executor Split

Intent

Separate the decision of what to do from the execution of how to do it.

Context

When a task requires multiple steps, reasoning about the plan and executing the steps are different cognitive activities. Mixing them leads to plans that drift and executions that lack coherence.

Forces

Planning requires broad context and strategic thinking
Execution requires focused context and precision
Tight coupling between planning and execution makes adaptation difficult

Structure

The planner creates a structured plan: a sequence (or graph) of steps with dependencies, required capabilities, and success criteria. The executor takes each step and carries it out within a focused worker. The planner reviews results and adapts the plan.

Dynamics

Plan → Execute step 1 → Review → Adapt plan → Execute step 2 → … → Consolidate

Benefits

Plans are inspectable and adjustable. Execution is focused. Adaptation is explicit.

Tradeoffs

The overhead of maintaining a plan is not justified for trivial tasks. The planner must be invoked again after each step, adding latency.

Failure Modes

The planner produces an overly detailed plan that constrains the executor unnecessarily. The executor deviates from the plan without reporting back, causing the planner to lose coherence. The plan-review cycle becomes a bottleneck when every step requires a full replanning pass.

Intent Router, Execution Loop Supervisor, Reflective Retry

Policy-Aware Scheduler

Intent

Prioritize and sequence work based on both task requirements and governance policies.

Context

When the kernel has multiple tasks to execute, it must decide what runs first, what runs in parallel, and what is blocked by policy (e.g., requires human approval).

Forces

Some tasks are urgent but risky
Some tasks are safe but low priority
Policy constraints can block otherwise-ready work

Structure

The scheduler maintains a priority queue of tasks, each annotated with risk level, dependencies, resource requirements, and policy status. It selects the next task to execute based on a scoring function that balances urgency, readiness, and governance compliance.

flowchart TD
  Q[Task Queue] --> Score[Score: urgency × readiness × policy]
  Score --> Check{Policy Gate}
  Check -->|Cleared| Exec[Execute]
  Check -->|Blocked| Wait[Wait for Approval]
  Check -->|Denied| Skip[Skip / Escalate]
  Wait -->|Approved| Exec
  Exec --> Next[Next Task]
  Skip --> Next

Dynamics

The scheduler continuously re-evaluates the queue as new tasks arrive, existing tasks complete, and policy decisions are returned. A task that was blocked may become ready when approval arrives. A task that was ready may be preempted by a higher-priority arrival. The scoring function runs on every cycle, not once at submission.

Benefits

High-priority safe work proceeds immediately. Risky work waits for appropriate approval without blocking the rest.

Tradeoffs

Scheduling logic adds complexity. Poor priority functions lead to starvation of important work.

Failure Modes

Priority inversion — a low-risk, low-priority task runs while a high-priority but policy-gated task starves waiting for approval. The scoring function over-weights urgency, causing risky work to bypass governance. Scheduling overhead dominates when the task queue is small and contention is low.

Permission Gate, Staged Autonomy

Result Consolidator

Intent

Synthesize outputs from multiple workers into a coherent, unified result.

Context

When a task is decomposed and delegated to multiple workers, their individual outputs must be combined. Different workers may produce overlapping, complementary, or conflicting results.

Forces

Workers operate independently with partial views
Results may conflict
Simple concatenation is rarely sufficient

Structure

The consolidator collects worker outputs, identifies overlaps and conflicts, resolves contradictions (or flags them for escalation), and synthesizes a unified result that addresses the original intent.

flowchart LR
  W1[Worker 1\nResult] --> Col[Consolidator]
  W2[Worker 2\nResult] --> Col
  W3[Worker 3\nResult] --> Col
  Col --> Ov{Overlaps?}
  Ov -->|Yes| Merge[Merge & Deduplicate]
  Ov -->|No| Combine[Combine]
  Col --> Conf{Conflicts?}
  Conf -->|Yes| Resolve[Resolve or Escalate]
  Conf -->|No| Pass[Pass Through]
  Merge --> Synth[Synthesize]
  Combine --> Synth
  Resolve --> Synth
  Pass --> Synth
  Synth --> Out[Unified Result]

Dynamics

Consolidation begins when all expected workers complete or when a timeout forces partial consolidation. The consolidator first aligns outputs to the original intent, then performs deduplication, conflict detection, and synthesis in a single pass. When conflicts are irreconcilable, the consolidator produces a result with explicit caveats rather than silently choosing a side.

Benefits

Coherent output despite distributed execution. Contradictions are surfaced, not hidden.

Tradeoffs

Consolidation itself requires model invocations and context. Complex consolidation can be as expensive as the original work.

Failure Modes

The consolidator silently drops a worker’s output because it does not fit the expected format. Contradictions are resolved by choosing the last result rather than the best, hiding minority viewpoints. Partial consolidation under timeout produces an incomplete result that appears complete.

Subagent as Process, Reviewer Process

Reflective Retry

Intent

When an action fails, analyze the failure before retrying — do not simply retry blindly.

Context

Failures in agentic systems are common: tools return errors, models produce invalid output, context is insufficient. Blind retries waste resources and often repeat the same failure.

Forces

Some failures are transient (network errors) — simple retry works
Some failures are structural (wrong approach) — retrying the same way is futile
Distinguishing between them requires reasoning

Structure

On failure, the kernel (or worker) analyzes the error: What went wrong? Is it transient or structural? If transient, retry with backoff. If structural, modify the approach (different tool, different decomposition, more context) and try again.

flowchart TD
  Fail[Action Failed] --> Analyze{Analyze Failure}
  Analyze -->|Transient| Retry[Retry with Backoff]
  Analyze -->|Structural| Modify[Modify Approach]
  Analyze -->|Unknown| Escalate[Escalate]
  Retry --> Result{Success?}
  Modify --> Result
  Result -->|Yes| Done[Continue]
  Result -->|No| Count{Retry Budget?}
  Count -->|Remaining| Analyze
  Count -->|Exhausted| Escalate

Dynamics

The first failure triggers analysis, not retry. The analysis classifies the failure and selects a strategy. Each subsequent failure feeds back into analysis with accumulated failure history, enabling the system to detect patterns (e.g., three different transient errors suggest a systemic issue, not a transient one). The retry budget decreases with each attempt, and the analysis becomes more conservative as attempts are consumed.

Benefits

Higher success rate with fewer wasted invocations. Structural problems are addressed, not repeated.

Tradeoffs

Reflection adds latency. The analysis itself can be wrong.

Failure Modes

The analysis misclassifies a structural failure as transient, wasting retry budget on an approach that will never succeed. The system modifies its approach so aggressively that the new approach is worse than the original. Reflection consumes significant context budget, leaving less room for the actual retry.

Recovery Process, Failure Containment

Execution Loop Supervisor

Intent

Monitor the kernel’s execution loop to prevent infinite cycles, resource exhaustion, and unproductive repetition.

Context

The kernel operates in a loop: plan, delegate, consolidate, adapt. Without supervision, this loop can run indefinitely — especially when the system retries failed tasks or refines plans endlessly.

Forces

Some tasks genuinely require many iterations
Some loops are pathological (no progress despite effort)
Arbitrary iteration limits are crude but necessary

Structure

The supervisor tracks loop metrics: iteration count, progress indicators, resource consumption, time elapsed. It triggers alerts or termination when:

Iteration count exceeds a threshold
No measurable progress is made across N iterations
Resource budget is exhausted

flowchart TD
  Loop[Kernel Loop Cycle] --> Track[Track Metrics]
  Track --> P{Progress?}
  P -->|Yes| Budget{Budget OK?}
  P -->|No| Stall[Stall Counter + 1]
  Budget -->|Yes| Continue[Continue Loop]
  Budget -->|No| Terminate[Terminate + Log]
  Stall --> Thresh{Stall > N?}
  Thresh -->|No| Continue
  Thresh -->|Yes| Terminate

Dynamics

The supervisor runs as a parallel observer, not as a gate within the loop. It samples progress at each cycle boundary — comparing the current state board to the previous one. Progress is measured by concrete indicators: tasks completed, outputs produced, state changes recorded. If the loop is making progress (even slowly), the supervisor permits continuation. If the loop is cycling without state change, the stall counter increments. Termination includes a diagnostic snapshot: last N iterations, resource consumption, and the state at which progress stalled.

Benefits

Prevents runaway execution. Provides diagnostic data for post-mortem analysis.

Tradeoffs

A strict supervisor may kill useful work that is simply slow. A lenient supervisor may waste resources.

Failure Modes

Progress indicators are too coarse — the loop appears to make progress (new outputs generated) but is actually oscillating between equivalent states. The supervisor terminates a task that was one iteration away from completion. The stall threshold is set uniformly when different task types have fundamentally different iteration patterns.

Resource Envelope, Context Budget Enforcement

Applicability Guide

Not every system needs every kernel pattern. Use the simplest approach that works.

Decision Matrix

Pattern	Apply When	Do Not Apply When
Intent Router	Requests vary in complexity; you need different execution strategies for different task types	All requests are uniform — a single execution path handles everything
Planner-Executor Split	Tasks require 3+ steps with dependencies; plan quality matters and plans need to be inspectable	Tasks are single-step or follow a fixed script; the overhead of a separate planning phase exceeds the value
Policy-Aware Scheduler	Multiple tasks compete for resources; some tasks have governance constraints that block execution	Tasks are processed sequentially with no contention; all tasks have identical governance profiles
Result Consolidator	Multiple workers produce partial results that must be synthesized into a coherent whole	Each worker produces a complete, standalone result; simple concatenation is sufficient
Reflective Retry	Failures contain diagnostic information that can inform a different approach; the task is worth retrying	Failures are deterministic (same input always fails); the task is cheap to abandon and restart
Execution Loop Supervisor	Tasks involve open-ended iteration (research, optimization); runaway loops are a real risk	Tasks have a fixed number of steps with deterministic termination

Start Simple

For a new system, start with only the Intent Router (to distinguish simple from complex requests) and the Planner-Executor Split (for complex requests). Add the other patterns as you observe specific failure modes:

Seeing resource contention? Add the Policy-Aware Scheduler.
Multi-worker results are incoherent? Add the Result Consolidator.
Workers fail and retry the same wrong approach? Add Reflective Retry.
Workers loop indefinitely? Add the Execution Loop Supervisor.

The cost of adding a pattern prematurely is unnecessary complexity. The cost of adding it too late is usually acceptable — integrate it when the need becomes clear.

Process Patterns

These patterns govern how workers are created, isolated, managed, and coordinated.

Subagent as Process

Intent

Model every unit of delegated work as an isolated process with its own context, capabilities, and lifecycle.

Context

When the kernel delegates work, the worker must operate independently without polluting the kernel’s context or other workers. This is the foundational pattern of the process fabric.

Forces

Workers need enough context to do their job
Workers must not see more than they need
The kernel must be able to inspect, manage, and terminate workers
Worker failures must not cascade

Structure

Each worker is spawned as an isolated process with:

A context sandbox (only relevant information)
A capability set (only authorized tools)
A task contract (inputs, constraints, outputs)
A lifecycle (creation, execution, completion/failure, termination)

flowchart LR
  K[Kernel] -->|spawn| W[Worker Process]
  subgraph W[Worker Process]
    Ctx[Context Sandbox]
    Cap[Capability Set]
    Con[Task Contract]
  end
  W -->|result| K
  W -->|failure| K
  W -.->|escalate| K

Dynamics

The kernel spawns the worker with a scoped context and capability set. The worker executes independently — it cannot access the kernel’s state or other workers’ contexts. On completion, the worker returns its result and is terminated. On failure, the worker returns a structured error. If the worker exceeds its resource envelope or timeout, the kernel terminates it forcefully and logs the event.

Benefits

Clean isolation. Predictable resource usage. Observable execution. Governed capabilities.

Tradeoffs

Isolation adds overhead (context copying, capability scoping). Very simple tasks may not justify the overhead.

Failure Modes

Over-isolation — the worker lacks critical context because the sandbox was too restrictive, producing low-quality results without signaling that it was under-informed. Under-isolation — shared mutable state between workers causes one worker’s output to corrupt another’s reasoning.

Context Sandbox, Scoped Worker Contract, Ephemeral Worker

Context Sandbox

Intent

Provide each worker with exactly the context it needs — no more, no less.

Context

Context windows are a finite resource. Giving a worker the entire conversation history wastes tokens and introduces noise. Giving it too little context leads to poor results.

Forces

More context → better understanding but more cost and potential confusion
Less context → faster and cheaper but higher risk of misunderstanding
The right amount depends on the task
Sensitive information may leak if context boundaries are not enforced

Structure

The kernel curates a context package for each worker: task definition, relevant retrieved information, relevant prior results, and any required constraints. Irrelevant history, other workers’ contexts, and system internals are excluded.

Dynamics

Context curation happens at spawn time. The kernel selects relevant memory from the memory plane, attaches the task contract, and excludes everything else. During execution, the worker may request additional context through memory-on-demand, which is fulfilled by the memory plane — not by direct access to the kernel’s state. The sandbox boundary is enforced, not advisory.

Benefits

Focused workers produce better results. Token budgets are respected. Information leakage between tasks is prevented.

Tradeoffs

Context curation requires effort and judgment. Mistakes in curation (missing critical information) lead to poor worker performance.

Failure Modes

The sandbox includes irrelevant context that distracts the worker (e.g., full conversation history injected “just in case”). Critical context is excluded because the curation logic lacks domain awareness. The worker infers missing information incorrectly rather than requesting it, producing confident but wrong results.

Subagent as Process, Memory on Demand

Ephemeral Worker

Intent

For one-shot tasks, spawn a worker that is fully discarded after completion.

Context

Many subtasks are atomic: summarize this document, extract these fields, classify this ticket. They do not need persistent state or long-running context.

Forces

Retaining worker state between tasks adds complexity and memory pressure
Some tasks benefit from a clean slate — no contamination from prior work
Worker creation overhead must be low enough to justify per-task spawning

Structure

The worker is created, given its task and context, produces its output, and is immediately terminated. No state is preserved from the worker itself — though the kernel may store the result in memory.

Dynamics

Create → Execute → Return result → Terminate. The entire lifecycle is a single pass with no state carried forward. If the same type of work recurs, a new ephemeral worker is spawned from scratch. The kernel’s memory plane provides any continuity needed — the worker itself is stateless.

Benefits

Minimal resource usage. No stale state. Clean lifecycle.

Tradeoffs

If the same type of work recurs, the lack of persistent state means the worker cannot learn from previous invocations. The kernel’s memory must compensate.

Failure Modes

Worker creation overhead becomes a bottleneck when hundreds of ephemeral workers are spawned per second. Results are lost if the kernel fails to capture them before the worker terminates. Ephemeral workers are used for tasks that actually need accumulated state, leading to repeated mistakes.

Context Sandbox, Subagent as Process

Scoped Worker Contract

Intent

Define a formal contract between the kernel and each worker specifying inputs, constraints, capabilities, outputs, and failure handling.

Context

Without a clear contract, workers operate on implicit assumptions. They may exceed their scope, use unauthorized tools, or produce output in unexpected formats.

Forces

Implicit contracts are flexible but lead to unpredictable behavior
Overly rigid contracts constrain workers that need adaptive reasoning
The contract must be machine-readable for enforcement, not just human-readable for documentation

Structure

flowchart TD
  subgraph Contract["Worker Contract"]
    direction TB
    Task["Task: Summarize PR changes"]
    Inputs["Inputs: diff, files_changed"]
    Cap["Capabilities: file.read, git.diff"]
    Con["Constraints: No file modifications, \u2264 500 words"]
    Out["Output: summary, risk_assessment"]
    Fail["On Failure: Return partial + explanation"]
    Time["Timeout: 30 seconds"]
  end
  K[Kernel] -->|defines| Contract
  Contract -->|governs| W[Worker]
  W -->|validated output| K

Dynamics

The kernel constructs the contract at delegation time. The worker receives it as part of its context sandbox. Upon completion, the kernel validates the output against the contract — checking type conformance, constraint compliance, and capability usage. Violations are logged and may trigger re-execution or escalation.

Benefits

Explicit expectations. Verifiable compliance. Clear failure semantics.

Tradeoffs

Contract authoring adds overhead. Contracts that are too specific prevent the worker from exercising useful judgment. Contracts that are too vague provide little enforcement value.

Failure Modes

The contract specifies the output format but not the quality criteria, so the worker returns well-formatted but useless output. The timeout is too aggressive for the task’s actual complexity, causing premature termination. Capabilities are copy-pasted from a template rather than scoped to the specific task.

Subagent as Process, Permission Gate

Parallel Specialist Swarm

Intent

Execute independent subtasks concurrently using specialized workers.

Context

When a complex request decomposes into independent subtasks (e.g., analyze code quality, check security vulnerabilities, review documentation), running them sequentially wastes time.

Forces

Parallelism reduces latency but increases peak resource consumption
Not all subtasks are truly independent — hidden dependencies cause failures
Consolidating parallel results is harder than consolidating sequential results

Structure

The kernel identifies independent subtasks, spawns a specialist worker for each, and runs them in parallel. Results are collected and consolidated when all workers complete (or when a timeout is reached).

flowchart TD
  K[Kernel] -->|spawn| W1[Security Specialist]
  K -->|spawn| W2[Quality Specialist]
  K -->|spawn| W3[Documentation Specialist]
  W1 -->|result| Col[Consolidator]
  W2 -->|result| Col
  W3 -->|result| Col
  Col --> Out[Unified Result]

Dynamics

Workers start simultaneously and execute independently. The kernel monitors each worker’s progress. As workers complete, their results are staged for consolidation. If one worker is significantly slower, the kernel may decide to consolidate partial results rather than waiting indefinitely. Failed workers trigger fallback strategies without blocking the other specialists.

Benefits

Dramatic latency reduction for decomposable tasks. Each specialist is optimized for its domain.

Tradeoffs

Parallel execution increases peak resource usage. Consolidation of parallel results adds complexity. Not all tasks are truly independent.

Failure Modes

Subtasks assumed to be independent are actually coupled — a security specialist’s findings depend on the quality specialist’s code analysis, but both run without the other’s output. One slow specialist blocks the entire swarm because the consolidator waits for all results. The result consolidator cannot reconcile contradictory specialist opinions.

Result Consolidator, Planner-Executor Split

Reviewer Process

Intent

Validate the output of a primary worker before returning it to the kernel.

Context

Worker output may be incorrect, incomplete, or inconsistent. A separate reviewer with different perspective or criteria can catch errors.

Forces

Primary workers are optimized for production, not self-critique
A separate reviewer brings fresh perspective but adds cost
Reviewing everything is wasteful; reviewing nothing is reckless

Structure

After a primary worker produces output, a reviewer process is spawned with: the original task, the output, and review criteria. The reviewer validates, critiques, and either approves or sends back for revision.

flowchart LR
  W[Primary Worker] -->|output| Rev[Reviewer]
  Rev -->|approved| K[Kernel]
  Rev -->|revision needed| W
  Rev -->|rejected| K

Dynamics

The reviewer operates on a different context than the primary worker — it sees the output against the original intent and review criteria, not the intermediate reasoning. This fresh perspective is the source of its value. Revision loops are bounded (typically 1–2 passes) to prevent infinite cycling. The reviewer’s critique is structured, not free-form, so the primary worker can act on specific feedback.

Benefits

Higher quality output. Catches errors that the primary worker is blind to.

Tradeoffs

Adds latency and cost. The reviewer itself can be wrong. Reviewing every task is overkill — use selectively for high-risk or high-value outputs.

Failure Modes

The reviewer rubber-stamps output without meaningful analysis because the review criteria are too vague. The revision loop oscillates — the reviewer and worker disagree on approach and trade revisions indefinitely. The reviewer applies different quality standards than the task actually requires, blocking acceptable output.

Result Consolidator, Recovery Process

Recovery Process

Intent

Handle failed tasks with a specialized recovery strategy rather than simple retries.

Context

When a worker fails, the failure may require diagnosis, alternative approaches, or human escalation — not just repeating the same work.

Forces

Simple retry works for transient failures but wastes resources on structural ones
Recovery requires understanding why the failure occurred, not just that it occurred
Multiple recovery attempts consume resources and add latency

Structure

A recovery process is spawned with: the original task, the failure details, and previous attempts. It analyzes the failure, determines a recovery strategy (retry with modifications, use alternative approach, request more context, escalate), and executes it.

flowchart TD
  Fail[Worker Failed] --> Rec[Recovery Process]
  Rec --> Diag[Diagnose Failure]
  Diag -->|Transient| Retry[Retry with Modifications]
  Diag -->|Structural| Alt[Alternative Approach]
  Diag -->|Missing Info| Req[Request More Context]
  Diag -->|Beyond Scope| Esc[Escalate to Human]
  Retry --> Result{Success?}
  Alt --> Result
  Req --> Result
  Result -->|Yes| Done[Return Result]
  Result -->|No| Budget{Recovery Budget?}
  Budget -->|Remaining| Rec
  Budget -->|Exhausted| Esc

Dynamics

The recovery process receives the full failure context: the original task, the worker’s output (if any), error messages, and the number of prior attempts. It diagnoses the failure class and selects a strategy. Each recovery attempt is itself bounded by a contract and timeout. Failed recoveries feed back into the diagnosis with accumulated evidence. The recovery budget decreases with each attempt, biasing toward escalation as options narrow.

Benefits

Intelligent failure recovery. Avoids wasting resources on repeated failures.

Tradeoffs

Recovery logic is itself fallible. Multiple recovery attempts can consume significant resources.

Failure Modes

The recovery process misdiagnoses the failure and applies the wrong strategy repeatedly. Recovery consumes more resources than the original task was worth. The recovery process itself fails, creating a recursive failure that the supervisor must catch.

Reflective Retry, Failure Containment, Human Escalation

Applicability Guide

Process patterns govern how workers are created, scoped, and managed. The right combination depends on your concurrency needs and failure tolerance.

Decision Matrix

Pattern	Apply When	Do Not Apply When
Subagent as Process	You need multiple workers with independent context and lifecycle	A single agent handles all work serially without context conflicts
Context Sandbox	Workers must be isolated from data and tools they should not access; security boundaries matter	You have a single trusted worker operating on a single task in a controlled environment
Ephemeral Worker	Tasks are stateless and independent; you want clean context for each task	Workers need to accumulate state across tasks (e.g., long-running monitoring); creation overhead is prohibitive
Scoped Worker Contract	Workers are delegated to with clear success criteria; you need verifiable outputs	The kernel executes everything directly; there is no delegation
Parallel Specialist Swarm	A task decomposes into independent, parallelizable subtasks assigned to different specialists	Tasks are inherently sequential; parallelism adds coordination cost without latency benefit
Reviewer Process	Output quality is critical and benefits from a separate review pass; adversarial checking adds value	The output has automated validation (tests, type-checks) that is sufficient; the review cost exceeds the quality gain
Recovery Process	Failures require diagnosis and alternative strategies, not just retries	Failures are transient (network timeouts) and simple retry is sufficient; or the task is cheap enough to abandon

The Minimum Viable Process Fabric

Start with Ephemeral Workers and Scoped Worker Contracts. These two patterns give you isolated execution with clear interfaces. Add the others as complexity demands:

Need security isolation? Add Context Sandbox.
Multiple independent subtasks? Add Parallel Specialist Swarm.
Quality-critical output? Add Reviewer Process.
Complex failure scenarios? Add Recovery Process.

Memory Patterns

These patterns govern how information is stored, retrieved, compressed, and managed across the memory plane.

Layered Memory

Intent

Organize memory into tiers with different characteristics: speed, capacity, retention, and purpose.

Context

A single flat memory (like a conversation history) cannot serve all needs. Current task context, recent interaction summaries, long-term knowledge, and audit records have fundamentally different access patterns and lifecycle requirements.

Forces

A single memory tier forces a tradeoff between size and speed
Different information has different lifetimes — task context is ephemeral, knowledge is persistent
Context windows are finite, so not everything can be loaded at once
Governance requires immutable audit records that must not be compressed or evicted

Structure

Working memory — Small, fast, ephemeral. Active task context.
Episodic memory — Medium, summarized. What happened recently.
Semantic memory — Large, indexed. What the system knows.
Audit memory — Append-only, immutable. What happened and why.

Each tier has its own storage, retrieval, and eviction strategies.

flowchart TD
  subgraph Working["Working Memory"]
    W["Fast · Small · Ephemeral"]
  end
  subgraph Episodic["Episodic Memory"]
    E["Medium · Summarized · Recent"]
  end
  subgraph Semantic["Semantic Memory"]
    S["Large · Indexed · Persistent"]
  end
  subgraph Audit["Audit Memory"]
    A["Append-only · Immutable"]
  end
  Working -->|compress| Episodic
  Episodic -->|extract| Semantic
  Working -->|log| Audit
  Episodic -->|log| Audit
  Semantic -.->|retrieve| Working
  Episodic -.->|retrieve| Working

Dynamics

During active execution, workers read from and write to working memory. At task boundaries, the kernel triggers compression: working memory is summarized into episodic memory. Over time, episodic entries are further distilled into semantic memory. Audit memory is written continuously and never compressed. Retrieval flows upward: when a worker needs historical context, the memory plane retrieves from episodic or semantic tiers and injects into working memory.

Benefits

Efficient context usage. Appropriate retention. Clear information lifecycle.

Tradeoffs

Tier boundaries add complexity. Information may be in the wrong tier at the wrong time — too soon for long-term storage, too late for working memory. Compression across tiers is lossy by design.

Failure Modes

All memory is treated as working memory, exhausting the context budget. Compression is too aggressive, discarding details that matter later. Tier boundaries are poorly defined, so information is duplicated across tiers without clear ownership.

Memory on Demand, Compression Pipeline, Pointer Memory

Pointer Memory

Intent

Instead of inserting large content into context, store a pointer (reference) that can be resolved on demand.

Context

Context windows are finite. Embedding full documents, code files, or data sets consumes budget that could be used for reasoning. Often, only a small portion of a large artifact is relevant.

Forces

Large artifacts must be accessible but cannot fit in context
Embedding full content wastes tokens on irrelevant sections
Indirection adds latency but dramatically reduces context pressure
Pointers can become stale if the underlying content changes

Structure

Store metadata and a reference (file path, document ID, chunk identifier) in the context. When the worker needs the actual content, it retrieves just the relevant portion through the memory plane.

Dynamics

At context curation time, large artifacts are replaced with pointers: a brief description, metadata (size, type, last modified), and a resolution mechanism. The worker reasons over the pointer metadata to decide whether it needs the full content. If it does, it issues a retrieval request. The memory plane resolves the pointer, retrieves the relevant portion, and injects it into working memory. Multiple pointers may be resolved selectively — the worker does not need to resolve all of them.

Benefits

Dramatically reduces context consumption. Enables work with artifacts much larger than the context window.

Tradeoffs

Retrieval adds latency. The pointer may become stale if the underlying content changes.

Failure Modes

The pointer’s metadata is insufficient for the worker to decide whether to resolve it, causing either unnecessary retrieval or missed critical content. The underlying content changes between pointer creation and resolution, producing inconsistent results. Workers resolve every pointer preemptively, negating the pattern’s benefits.

Memory on Demand, Context Sandbox

Memory on Demand

Intent

Retrieve context from memory only when a worker actually needs it, not preemptively.

Context

Preloading all potentially relevant context into every worker wastes tokens and introduces noise. Many workers need only a small subset of available knowledge.

Forces

Preloading maximizes information but wastes context budget
On-demand retrieval minimizes waste but requires the worker to know what it needs
Retrieval round-trips add latency to the execution loop
The worker may not realize it is missing critical information

Structure

Workers are given their task and minimal context. When they identify a need for additional information, they issue a memory retrieval request. The memory plane fulfills the request and injects the relevant context.

Dynamics

The worker begins execution with a lean context. As it reasons about the task, it encounters knowledge gaps: unfamiliar terms, missing reference data, or insufficient background. It formulates a retrieval query and issues it to the memory plane. The memory plane searches across the appropriate tiers (semantic memory for knowledge, episodic memory for recent events) and returns the most relevant results. The worker integrates the retrieved context and continues. Multiple retrieval rounds may occur within a single task.

Benefits

Minimal upfront context cost. Workers self-select what they need. Relevant information arrives when it is needed.

Tradeoffs

Workers must be able to recognize what they do not know. Multiple retrieval round-trips add latency.

Failure Modes

The worker does not realize it needs information and produces results based on incomplete knowledge. Retrieval returns irrelevant results because the query was poorly formed. Multiple retrieval rounds consume more total context than preloading would have — the cure is worse than the disease.

Pointer Memory, Layered Memory

Operational State Board

Intent

Maintain a shared, structured view of the current operational state of the system.

Context

As work progresses, the system accumulates state: which tasks are complete, which are in progress, what results have been collected, what decisions have been made. Without a central state board, this information is scattered and easily lost.

Forces

Multiple workers and the kernel need a consistent view of progress
State scattered across worker contexts is inaccessible and unsynchronized
The state board must be structured enough to be machine-readable but flexible enough to accommodate different task types

Structure

A structured state object that tracks:

Active plan and its status
Completed tasks and their results
Pending tasks and their dependencies
Open questions and blockers
Resource usage

The kernel updates the state board after each step. Workers can read relevant portions.

Dynamics

The kernel writes to the state board at each cycle boundary: updating task status, recording results, flagging blockers. Workers receive a read-only view of the portions relevant to their task. The state board is the kernel’s primary source of truth for planning decisions — it answers questions like “what is done?”, “what is blocked?”, and “what has changed?” Adaptation decisions (replanning, escalation, termination) are driven by state board inspection.

Benefits

Single source of truth for system state. Enables informed planning and adaptation.

Tradeoffs

Maintaining the state board adds overhead to every kernel cycle. A stale state board is worse than no state board — it produces confident but wrong planning decisions.

Failure Modes

The state board becomes a bottleneck when many workers attempt concurrent updates. Stale entries mislead the kernel into replanning based on outdated information. The board grows unboundedly when completed task entries are never archived.

Active Plan Board, Execution Journal

Memory Reconciliation

Intent

When information from different sources or tiers conflicts, resolve the contradictions explicitly.

Context

Workers may produce conflicting results. Semantic memory may contain outdated facts. Episodic memory may record a decision that was later reversed. These contradictions must be caught and resolved.

Forces

Multiple independent workers may write conflicting information without awareness of each other
Newer information is usually but not always more accurate
Silent contradiction degrades downstream reasoning without visible symptoms

Structure

The memory plane detects contradictions (through embeddings, timestamps, or explicit flags). It presents the conflict to the kernel with context for each side. The kernel (or a specialist worker) resolves the contradiction and updates memory accordingly.

Dynamics

Contradiction detection runs at write time: when new information enters a memory tier, it is checked against existing entries. If a conflict is found, the write is held and the conflict is surfaced. The kernel evaluates the conflicting entries based on recency, source authority, and supporting evidence. Resolution may produce a merged entry, a replacement, or a flagged disagreement that is preserved for future reference. The resolution decision is recorded in audit memory.

Benefits

Memory stays consistent. Contradictions are surfaced rather than silently degrading quality.

Tradeoffs

Detection is imperfect. Resolution requires reasoning and costs resources.

Failure Modes

The detection logic has too narrow a scope, catching only exact duplicates while missing semantic contradictions. Resolution always favors the most recent entry, discarding earlier information that was actually correct. False positives in conflict detection create a flood of reconciliation tasks that overwhelm the kernel.

Contradiction Pruning, Compression Pipeline

Compression Pipeline

Intent

Reduce the size of stored memories while preserving their essential information.

Context

Over time, episodic memory accumulates detailed records that are too large to fit into working memory. Raw records must be compressed into summaries that preserve key insights.

Forces

Uncompressed memory grows without bound, eventually exhausting storage and retrieval budgets
Compression is inherently lossy — some information will be discarded
Different types of information tolerate different levels of compression
Audit records must never be compressed

Structure

A pipeline that processes memories through stages:

Filter — Remove noise and irrelevant details
Summarize — Compress the remaining content into key points
Index — Create searchable metadata for retrieval
Store — Write the compressed memory to the appropriate tier

flowchart LR
  Raw[Raw Memory] --> Filter[Filter Noise]
  Filter --> Sum[Summarize]
  Sum --> Index[Index & Tag]
  Index --> Store[Store in Tier]

Dynamics

Compression runs asynchronously at tier boundaries: working memory is compressed into episodic memory after a task completes, and episodic memory is further compressed into semantic memory on a schedule or when a size threshold is reached. Each stage applies domain-aware logic — code summaries preserve function signatures and key changes, research summaries preserve findings and sources. The original raw memory may be retained briefly for recovery, then discarded.

Benefits

Memory stays manageable. Historical context is preserved at appropriate granularity.

Tradeoffs

Compression is lossy. Important details may be discarded if the compression logic is poor.

Failure Modes

The summarizer discards critical details (e.g., a specific error message that would diagnose a recurring bug). Compression runs too infrequently, causing memory to grow until retrieval performance degrades. Compression runs too aggressively, losing detail before it has been fully utilized.

Layered Memory, Memory Reconciliation

Contradiction Pruning

Intent

Proactively identify and remove contradicted or outdated information from memory.

Context

As the system operates, earlier beliefs or facts may be superseded by newer, more accurate information. Keeping contradicted information degrades reasoning quality.

Forces

Outdated information actively degrades reasoning when retrieved alongside current facts
Aggressive pruning risks discarding information that is still valid or useful for context
Pruning decisions should be auditable — knowing what was removed and why

Structure

Periodically (or on trigger), scan memory for entries that are contradicted by newer entries. Mark or remove the outdated entries. Record the pruning decision in audit memory.

Dynamics

Pruning runs as a background maintenance task. It scans semantic and episodic memory for entries that have been superseded: a newer entry with the same subject and higher confidence, an explicit correction recorded by a worker, or a time-based expiry. Contradicted entries are not deleted immediately — they are first marked as deprecated and excluded from retrieval. After a retention period, deprecated entries are archived or deleted. Every pruning decision is logged with the contradicting evidence.

Benefits

Cleaner memory. Better reasoning. Reduced confusion.

Tradeoffs

Aggressive pruning may remove information that turns out to be relevant later. The pruning logic itself must be reliable.

Failure Modes

The pruning logic removes a fact that was correct, based on a newer entry that was actually wrong (e.g., a hallucinated correction). Pruning runs so infrequently that contradicted information is retrieved many times before being cleaned. The pruning log grows large but is never reviewed, hiding systematic errors in the pruning logic.

Memory Reconciliation, Compression Pipeline

Applicability Guide

Memory patterns range from essential (every system needs some form of layered memory) to advanced (contradiction pruning is for mature systems with rich accumulated knowledge).

Decision Matrix

Pattern	Apply When	Do Not Apply When
Layered Memory	The system operates across sessions; past context improves future performance	The system is stateless by design — each request starts from zero and that is acceptable
Pointer Memory	Full artifacts are too large for context windows; you need references that resolve on demand	All relevant data fits comfortably in the context window; indirection adds complexity without benefit
Memory on Demand	Context windows are precious; loading all potentially relevant memory upfront wastes tokens	The system has abundant context capacity or the memory corpus is small enough to include entirely
Operational State Board	Multiple workers need shared visibility into a task’s progress, findings, and decisions	A single worker handles the entire task; there is no shared state to coordinate
Memory Reconciliation	Multiple workers produce memories that may overlap or conflict; the system accumulates knowledge over time	Memories are append-only and never revised; or each memory source is authoritative in its domain with no overlaps
Compression Pipeline	The memory store grows unboundedly; older memories need summarization to remain useful without consuming storage	The memory corpus is bounded and manageable; or every detail must be preserved at full fidelity (audit requirements)
Contradiction Pruning	The system has long-lived memory where earlier facts are frequently superseded	The system’s domain has few contradictions; or memories are versioned and consumers handle contradiction themselves

Progressive Memory Architecture

Phase 1 (MVP): Layered Memory with working memory (per-task context) and a simple semantic store (embeddings over documents). This is sufficient for most single-session and short-lived multi-session systems.

Phase 2 (Growth): Add Memory on Demand and Pointer Memory as the corpus grows beyond what fits in context. Add Operational State Board when you introduce multi-worker coordination.

Phase 3 (Maturity): Add Compression Pipeline, Memory Reconciliation, and Contradiction Pruning as the system accumulates months of operational knowledge and stale information begins to degrade quality.

Operator Patterns

These patterns govern how the Agentic OS interacts with external tools, APIs, and services.

Tool as Operator

Intent

Wrap every external tool as a typed, permissioned, observable operator.

Context

Raw tool access — “give the model a function and let it call it” — lacks governance, typing, and observability. Wrapping tools as operators provides the control surface needed for production systems.

Forces

Direct tool access is fast and simple but ungoverned
Governance and observability add overhead that must be justified
Different tools have wildly different interfaces, reliability, and risk profiles
The system must treat all tools uniformly while respecting their differences

Structure

An operator wraps a tool with:

Type signature — Inputs and outputs are explicitly typed
Permission requirements — What capabilities are needed to invoke it
Risk classification — Low, medium, high
Observability — Invocations are logged with inputs, outputs, latency, and errors
Error handling — Failures are captured and returned as structured results

flowchart LR
  W[Worker] -->|invoke| Op[Operator Wrapper]
  subgraph Op[Operator]
    Auth[Check Permissions]
    Log1[Log Request]
    Tool[External Tool]
    Log2[Log Response]
    Err[Error Handler]
  end
  Auth --> Tool
  Tool --> Log2
  Tool -->|error| Err
  Op -->|result| W

Dynamics

When a worker invokes an operator, the wrapper first validates permissions against the worker’s capability set. If authorized, it logs the request, invokes the underlying tool, logs the response (or error), and returns a typed result. The invocation is fully observable — latency, success/failure, input/output hashes — without exposing sensitive content.

Benefits

Governed, observable, reliable tool access. Consistent interface regardless of underlying tool implementation.

Tradeoffs

Wrapper overhead adds latency to every tool invocation. Maintaining operator wrappers requires effort when underlying tools change their interfaces.

Failure Modes

The wrapper obscures tool-specific error details behind a generic error type, making diagnosis difficult. Permission checks are too coarse, blocking legitimate use or too permissive, allowing unauthorized access. Observability logging captures sensitive data that should not be persisted.

Operator Registry, Operator Isolation, Skill over Operators

Operator Registry

Intent

Maintain a central catalog of all available operators with their metadata, permissions, and status.

Context

As the number of available tools grows, the system needs a way to discover, select, and manage them. Without a registry, tool selection is ad-hoc and ungoverned.

Forces

Tool sprawl — dozens of tools with overlapping capabilities
Workers need to discover available tools dynamically, not through hardcoded references
The registry must be the single source of truth for tool availability and governance

Structure

A registry that stores for each operator:

Name and description
Type signature
Permission requirements
Risk classification
Status (active, deprecated, disabled)
Usage metrics

The kernel consults the registry when deciding which operators to make available to a worker.

Dynamics

At worker spawn time, the kernel queries the registry for operators that match the worker’s task requirements and capability set. The registry returns only active, authorized operators. At runtime, operators may be added, deprecated, or disabled without restarting the system. Usage metrics (invocation count, error rate, latency) are updated after each invocation and inform future selection decisions.

Benefits

Central governance point. Dynamic capability management. Clear documentation.

Tradeoffs

The registry becomes a single point of failure for tool discovery. Registry maintenance requires discipline — stale entries mislead workers.

Failure Modes

The registry contains stale entries for tools that no longer exist, causing invocation failures. Deprecated operators are still selected because the replacement was not registered. The registry grows to include hundreds of operators, making tool selection noisy and imprecise.

Tool as Operator, Capability-Based Access

Skill over Operators

Intent

Compose multiple operators into a higher-level reusable recipe called a skill.

Context

Many real-world tasks require a specific sequence of tool invocations with logic connecting them. Rather than having the model improvise this sequence every time, encode it as a skill.

Structure

A skill is a named, tested recipe that:

Combines specific operators in a defined sequence or graph
Includes logic for handling intermediate results
Has its own input/output contract
Is registered and versioned

Example:

flowchart LR
  subgraph Skill["Skill: Code Review"]
    direction TB
    S1["1. git.diff\nGet changes"]
    S2["2. file.read\nRead changed files"]
    S3["3. analyze\nAssess quality & risks"]
    S4["4. comment.create\nPost review feedback"]
    S1 --> S2 --> S3 --> S4
  end
  W[Worker] -->|invoke| Skill
  Skill -->|result| W

Dynamics

When a worker invokes a skill, the skill engine steps through the defined sequence, invoking each operator and passing results forward. At each step, the engine evaluates success criteria before proceeding. If a step fails, the skill’s error handling logic determines the recovery strategy (retry, skip, abort). Skills are versioned — updating a skill creates a new version while preserving the previous one. Workers always invoke a specific skill version or the latest.

Benefits

Consistency and reliability. Tested workflows. Reusable across contexts.

Tradeoffs

Skills are less flexible than improvised workflows. They must be maintained as tools and APIs evolve.

Failure Modes

A skill encodes an operator sequence that worked at creation time but breaks after an operator’s interface changes. Skills are too rigid, forcing workers through unnecessary steps. Skills are too numerous and overlapping, making it unclear which skill to use for a given task.

Composable Operator Chain, Patternized Skills

Composable Operator Chain

Intent

Allow operators to be chained into pipelines where the output of one becomes the input of the next.

Context

Many tasks are naturally pipelines: fetch → transform → validate → store. Expressing these as chains makes them composable and reusable.

Forces

Multi-step operations need clear data flow between stages
Tight coupling between stages prevents reuse
Each stage in a chain may fail independently

Structure

Operators expose typed inputs and outputs. The system matches output types to input types, forming a pipeline. Each step in the chain is independently observable and governable.

flowchart LR
  A[Fetch] -->|data| B[Transform]
  B -->|transformed| C[Validate]
  C -->|valid| D[Store]
  C -->|invalid| E[Error Handler]

Dynamics

The chain executes sequentially. Each operator receives the previous operator’s output as its input. Type checking at chain boundaries catches mismatches before invocation. If any operator fails, the chain stops and reports which stage failed, with the partial results collected so far. Chains can be defined declaratively and stored as reusable recipes.

Benefits

Clean data flow. Each operator is testable in isolation. Chains are composable — new pipelines from existing operators.

Tradeoffs

Chains are rigid — branching logic requires escaping the pipeline model. Long chains amplify latency from sequential execution.

Failure Modes

A type mismatch between stages causes a runtime error that should have been caught at chain definition time. A mid-chain failure loses the partial results from earlier stages. The chain abstraction is applied to operations that are not truly pipelines, forcing awkward data transformations between stages.

Skill over Operators, Tool as Operator

Operator Isolation

Intent

Ensure that a failure in one operator does not crash the system or corrupt other operators.

Context

External tools fail. APIs time out. Services return errors. These failures must be contained.

Forces

External tools are outside the system’s control — they can fail in unexpected ways
A single tool failure should not propagate to other tools or workers
Isolation must not add so much overhead that tool invocation becomes impractical

Structure

Each operator invocation runs in its own error boundary. Failures are captured as structured results (not exceptions) and returned to the caller. The caller (worker or kernel) decides how to handle the failure.

Dynamics

The operator wrapper intercepts all failure modes: exceptions, timeouts, malformed responses, and resource exhaustion. Each failure is converted to a structured error result with a failure category, message, and the partial output (if any). The wrapper enforces a timeout: if the underlying tool does not respond within the configured window, the invocation is terminated and a timeout error is returned. The worker receives the error result and decides: retry, fall back, or escalate.

Benefits

System stability. Graceful degradation. Clear error handling paths.

Tradeoffs

Isolation overhead adds latency. Structured error conversion may lose tool-specific diagnostic details.

Failure Modes

The isolation boundary leaks — a tool that consumes excessive memory affects other operators sharing the same process. Timeout values are set too aggressively, killing tools that are slow but would eventually succeed. The structured error result lacks enough detail for the worker to choose the right recovery strategy.

Operator Fallback, Failure Containment

Operator Fallback

Intent

When a primary operator fails, automatically attempt an alternative operator that can fulfill the same need.

Context

External services are unreliable. Having a fallback operator reduces the impact of failures.

Structure

The operator registry associates fallback operators with primary operators:

flowchart LR
  W[Worker] -->|invoke| P["Primary: search_web\n(API A)"]
  P -->|success| R[Result]
  P -->|timeout / 5xx| F["Fallback: search_web_alt\n(API B)"]
  F -->|success| R
  F -->|failure| Err[Combined Error]

Dynamics

When the primary operator returns a failure matching the fallback condition, the system automatically invokes the fallback operator with the same inputs. The fallback result is returned to the caller transparently. If the fallback also fails, the combined failure is reported. Fallback invocations are flagged in the execution journal so that persistent primary failures trigger operational alerts.

Benefits

Higher reliability. Transparent to the caller.

Tradeoffs

Fallback operators may have different characteristics (latency, result quality). Managing fallback chains adds complexity.

Failure Modes

The fallback operator has subtly different semantics than the primary, producing results that appear correct but differ in important ways. Both primary and fallback fail, but the combined error message is confusing. Fallback masks a systemic primary failure, delaying investigation.

Operator Isolation, Tool as Operator

Resource-Aware Invocation

Intent

Consider resource costs (latency, tokens, API limits) when deciding which operator to invoke and how.

Context

Operators have costs: API rate limits, token consumption, latency, monetary cost. Ignoring these leads to budget exhaustion, throttling, or excessive latency.

Forces

Different operators have vastly different cost profiles
Budget constraints are real — rate limits, token budgets, monetary budgets
A cheaper operator may produce acceptable results for low-stakes tasks
Cost information must be available at decision time, not discovered after invocation

Structure

The kernel tracks resource budgets and operator costs. Before invoking an operator, it checks:

Is the budget sufficient?
Is the operator within rate limits?
Is a cheaper alternative available?
Should this invocation be batched or deferred?

Dynamics

The kernel maintains a real-time resource ledger: tokens consumed, API calls made, cost incurred. Before each operator invocation, the scheduler consults the ledger and the operator’s cost profile. If the budget is sufficient, the invocation proceeds and the ledger is updated. If the budget is low, the scheduler may select a cheaper alternative, batch the invocation with others, or defer it to a lower-priority queue. Rate-limited operators include a backoff window in their cost profile.

Benefits

Sustainable execution. Cost control. Graceful behavior under resource pressure.

Tradeoffs

Cost tracking adds overhead. Cost estimates may be inaccurate, leading to either over-cautious scheduling or budget overruns.

Failure Modes

Cost profiles are stale — the operator’s actual cost has changed but the registry has not been updated. The system defers critical invocations to save budget, degrading result quality. Resource accounting is not thread-safe, allowing parallel workers to collectively exceed the budget.

Resource Envelope, Context Budget Enforcement

Applicability Guide

Operator patterns structure how the system interacts with the external world. Start with the minimum tooling surface and expand deliberately.

Decision Matrix

Pattern	Apply When	Do Not Apply When
Tool as Operator	The system needs to interact with external services, files, or APIs through a uniform interface	The system is purely reasoning-based with no external side effects
Operator Registry	You have 5+ tools; workers need to discover tools dynamically; governance must scope tool access	You have 1-2 hardcoded tools that every worker always uses
Skill over Operators	Recurring multi-step workflows benefit from packaged instructions, strategies, and tool selections	Every task is novel; no workflow repeats enough to justify packaging
Composable Operator Chain	Multi-stage operations where one tool’s output feeds the next (e.g., search → fetch → extract)	Each tool invocation is independent; composition adds indirection without value
Operator Isolation	Tool failures should not crash the worker; untrusted tools need sandboxing	All tools are trusted, well-tested, and fast; isolation overhead is not justified
Operator Fallback	Primary tools have reliability issues; alternative providers exist	Each tool is unique with no equivalent alternative; or reliability is already sufficient
Resource-Aware Invocation	Tools have rate limits, costs, or latency constraints that require budgeting	Tools are free, unlimited, and fast; cost tracking adds overhead without benefit

Start Here

Every system needs Tool as Operator (a structured interface to external capabilities). Add the Operator Registry once you have more than a handful of tools. Add Skill over Operators when you notice teams repeatedly assembling the same tool combinations for similar tasks. The other patterns respond to operational pressures — add them when you observe the specific problem they solve.

Governance Patterns

These patterns establish how authority, permissions, and accountability flow through an Agentic OS. Without governance, agentic systems become unpredictable liabilities. With too much governance, they become paralyzed. These patterns find the balance.

Capability-Based Access

Intent

Grant agents only the specific capabilities they need, rather than broad role-based permissions.

Context

When agents invoke operators, access memory, or spawn subprocesses, they require permissions. Traditional role-based access is too coarse — an agent with the “developer” role might have access to delete production databases when it only needs to read a configuration file.

Forces

Broad permissions increase blast radius of failures
Fine-grained permissions increase configuration complexity
Agents may need different capabilities for different tasks

Structure

Each agent or process receives a capability set — a list of specific, granular permissions such as read:file:/config/*, write:memory:working, invoke:operator:search. Capabilities are scoped by resource, action, and optionally by time or invocation count.

Dynamics

At process creation, the kernel assigns a capability set based on the task contract. Operators check capabilities before execution. Capabilities can be narrowed when delegating to subprocesses but never widened without explicit escalation.

Benefits

Minimal blast radius. Clear auditability. Composable security.

Tradeoffs

More configuration surface. Capability sets must be maintained and versioned.

Failure Modes

Over-permissive defaults defeat the pattern. Under-permissive defaults cause legitimate work to fail silently.

Least Privilege Agent, Permission Gate, Operator Isolation

Least Privilege Agent

Intent

Ensure every agent operates with the minimum permissions necessary to accomplish its assigned task.

Context

When spawning workers or delegating tasks, the kernel must decide how much access to grant. The default instinct is to give generous access to avoid failures. This instinct is dangerous.

Forces

Minimal permissions reduce risk but increase the chance of permission-related failures
Generous permissions simplify development but increase attack surface
Permission requirements may not be fully known in advance

Structure

Each task contract specifies required capabilities. The kernel grants exactly those capabilities and nothing more. If a worker discovers it needs additional access, it must request escalation through a permission gate rather than failing silently or improvising.

Dynamics

Start restricted. Escalate explicitly. Log every escalation. Review escalation patterns to refine default capability sets over time.

Benefits

Contained failure. Reduced risk. Escalation patterns reveal design gaps.

Tradeoffs

Initial development is slower. Workers may fail on edge cases that require unexpected permissions.

Failure Modes

Workers that silently degrade rather than requesting escalation. Blanket escalation approvals that defeat the pattern.

Capability-Based Access, Permission Gate, Human Escalation

Permission Gate

Intent

Create explicit checkpoints where execution pauses for authorization before crossing a trust boundary.

Context

Some actions carry consequences that are difficult or impossible to reverse — deploying code, sending emails, modifying financial records, deleting data. These actions should not flow automatically through the system.

Forces

Ungated actions are fast but risky
Too many gates create friction and fatigue
Gate placement requires understanding of consequence severity

Structure

A permission gate is a policy-defined checkpoint in the execution path. When an agent reaches a gate, execution suspends. The gate presents the proposed action, context, and risk assessment to the appropriate authority — which may be a human, a policy engine, or a higher-privilege agent. Execution resumes only upon approval.

Dynamics

Agent proposes action → Gate intercepts → Authority reviews → Approve or deny → Execution resumes or aborts. Denied actions are logged with reasoning. Approved actions are logged with the approver’s identity.

flowchart LR
  A[Agent Proposes Action] --> G{Permission Gate}
  G -->|Low Risk| Auto[Auto-Approve + Log]
  G -->|High Risk| Rev[Authority Reviews]
  Rev -->|Approved| Exec[Execute + Log]
  Rev -->|Denied| Block[Block + Log Reason]
  Auto --> Exec

Benefits

Reversible control over irreversible actions. Clear accountability chain. Adjustable friction based on risk.

Tradeoffs

Latency increases at gates. Approval fatigue can lead to rubber-stamping.

Failure Modes

Gates on trivial actions desensitize approvers. Missing gates on critical actions eliminate the safety net.

Capability-Based Access, Human Escalation, Risk-Tiered Execution

Human Escalation

Intent

Provide a structured mechanism for agents to transfer decision authority to a human when they encounter situations beyond their competence or permission.

Context

No agent is competent for all situations. Ambiguous intent, conflicting constraints, ethical dilemmas, high-stakes consequences, and novel situations all exceed what an agent should handle autonomously.

Forces

Escalating too often makes the system useless
Escalating too rarely makes the system dangerous
Escalation must preserve enough context for the human to decide effectively

Structure

The escalation package includes: the original intent, the decision point reached, the options considered, the agent’s recommendation (if any), relevant memory and context, and the specific question for the human. The human can approve, deny, redirect, or provide additional guidance.

Dynamics

Agent detects boundary → Packages context → Suspends execution → Human reviews → Human responds → Agent incorporates response and resumes. The agent should not silently proceed with a guess when escalation is warranted.

Benefits

Appropriate human authority. Preserved context. Explicit decision trail.

Tradeoffs

Requires responsive human operators. Escalation packaging consumes resources.

Failure Modes

Poor escalation packaging forces humans to reconstruct context. Unreachable humans block execution indefinitely.

Permission Gate, Staged Autonomy, Risk-Tiered Execution

Auditable Action

Intent

Ensure every significant action taken by the system is recorded with sufficient context to reconstruct the reasoning chain.

Context

In governed systems, it is not enough to know what happened. We must know why it happened, who authorized it, what information was available, and what alternatives were considered.

Forces

Comprehensive logging increases storage and processing costs
Insufficient logging makes incidents uninvestigatable
Log granularity must balance detail with noise

Structure

Every action record includes: timestamp, acting agent identity, the intent being served, the specific action taken, inputs consumed, outputs produced, capabilities exercised, approvals obtained, and the outcome. Actions are immutable once recorded.

Dynamics

Before execution: log intent and plan. During execution: log operator invocations and results. After execution: log outcome, duration, and resource consumption. On failure: log error context and recovery actions.

Benefits

Full reconstruction of decision chains. Compliance support. Debugging foundation. Trust building.

Tradeoffs

Storage costs. Performance overhead. Risk of logging sensitive data.

Failure Modes

Logging failures that silently drop records. Over-logging that makes meaningful patterns invisible.

Execution Journal, Signed Intent, Checkpoints and Rollback

Signed Intent

Intent

Attach cryptographic or structural proof of origin and authorization to every intent flowing through the system.

Context

As intents are decomposed, delegated, and executed across multiple agents, it becomes unclear who originally authorized the work and whether the current execution still aligns with the original request.

Forces

Unsigned intents can be forged or corrupted during delegation
Verification overhead increases with delegation depth
Deep delegation chains may mutate intent beyond recognition

Structure

The original intent receives a signature from its source (user, API caller, or orchestrator). As the kernel decomposes the intent into sub-intents, each sub-intent carries a reference to its parent and the original signed intent. Any agent can verify that its current work traces back to a legitimately authorized request.

Dynamics

User issues intent → Kernel signs and decomposes → Sub-intents carry parent chain → Workers verify chain before executing → Audit trail links every action to its origin.

Benefits

Tamper-evident intent chains. Origin traceability. Prevention of unauthorized intent injection.

Tradeoffs

Signature management complexity. Performance cost of verification.

Failure Modes

Broken signature chains when sub-intents are reconstructed. Performance degradation from redundant verification.

Auditable Action, Capability-Based Access

Risk-Tiered Execution

Intent

Apply different levels of scrutiny, approval, and safeguards based on the assessed risk of an action.

Context

Not all actions carry equal consequences. Reading a file is low-risk. Deploying to production is high-risk. Sending a financial transaction is critical-risk. Treating all actions identically — either with maximum scrutiny or minimum — is wasteful or dangerous.

Forces

Uniform high scrutiny creates unacceptable latency for routine work
Uniform low scrutiny creates unacceptable risk for critical work
Risk assessment itself can be wrong

Structure

Define risk tiers (e.g., routine, elevated, critical). Classify actions into tiers based on: reversibility, scope of impact, data sensitivity, and cost. Each tier maps to a governance response: routine actions proceed with logging only, elevated actions require policy check, critical actions require human approval.

Dynamics

Agent proposes action → Risk classifier assigns tier → Governance pipeline applies tier-appropriate controls → Action proceeds or gates. Risk classifications are refined over time based on incident analysis.

flowchart TD
  Act[Proposed Action] --> Class{Risk Classifier}
  Class -->|Routine| Log[Log Only]
  Class -->|Elevated| Pol[Policy Check]
  Class -->|Critical| Human[Human Approval]
  Log --> Exec[Execute]
  Pol -->|Pass| Exec
  Pol -->|Fail| Block[Block]
  Human -->|Approved| Exec
  Human -->|Denied| Block

Benefits

Appropriate friction. Fast routine operations. Protected critical operations.

Tradeoffs

Risk misclassification can under- or over-protect. Tier boundaries require ongoing calibration.

Failure Modes

Novel actions that don’t fit existing tiers default to the wrong level. Tier inflation where everything becomes “critical.”

Permission Gate, Human Escalation, Staged Autonomy

Applicability Guide

Governance patterns are not optional — but the depth of governance should match the risk profile of your domain. Over-governing a low-risk system wastes resources. Under-governing a high-risk system invites incidents.

Decision Matrix

Pattern	Apply When	Do Not Apply When
Capability-Based Access	Workers have different trust levels; principle of least privilege matters; you need to scope tool access per worker	All workers are fully trusted and run in a controlled environment with no sensitive resources
Least Privilege Agent	Workers should only access what they need; over-provisioning creates security risk	The system has a single worker operating in a sandbox where all resources are pre-scoped
Permission Gate	Some actions require human or policy approval before execution; irreversible actions need checkpoints	All actions are reversible, low-risk, and pre-approved; gates would add latency with no safety benefit
Human Escalation	The system encounters situations beyond its confidence or authority; a human needs to make the call	The system operates autonomously by design with no human available; or all decisions are within the system’s authority
Auditable Action	Compliance requires a record of what happened and why; debugging requires post-hoc analysis	The system is a throwaway prototype with no accountability requirements
Signed Intent	Multi-party authorization matters; you need to prove that a specific operator authorized a specific action	Single-user system with simple authentication; provenance is not a concern
Risk-Tiered Execution	Actions vary widely in risk (from reading a file to deleting a database); governance overhead should match risk	All actions have the same risk profile; a uniform policy is simpler and sufficient

The Non-Negotiables

For any production system: Capability-Based Access, Auditable Action, and Human Escalation are non-negotiable. They cost little to implement and prevent the most common and harmful failure modes.

Permission Gate and Risk-Tiered Execution should be added as soon as the system performs any irreversible or externally-visible actions.

Signed Intent and Least Privilege Agent are important for multi-user, multi-team, or regulated environments.

Runtime Patterns

These patterns govern how an Agentic OS manages execution state, tracks progress, handles failures, and controls resource consumption during live operation.

Active Plan Board

Intent

Maintain a shared, inspectable representation of the current execution plan that all participants can read and the kernel can update.

Context

When work is decomposed into multiple steps executed by different workers, no single agent holds the complete picture. Without a shared plan, workers operate blindly, the kernel cannot assess overall progress, and adaptation becomes guesswork.

Forces

A shared plan enables coordination but introduces a synchronization point
Plan detail must be sufficient for tracking but not so detailed that it becomes stale instantly
Plans change during execution — the board must support mutation

Structure

The active plan board is a structured document in working memory containing: the goal, current phase, ordered steps with status (pending, in-progress, completed, failed, skipped), dependencies between steps, assigned workers, and intermediate results. The kernel updates the board as work progresses.

Dynamics

Kernel creates initial plan board → Workers report progress → Kernel updates board → Board reflects reality → Kernel uses board to decide next actions and detect stalls. The board is the single source of truth for execution state.

Benefits

Transparency. Coordination without tight coupling. Debuggable execution. Human-inspectable progress.

Tradeoffs

Board maintenance overhead. Stale boards if updates are delayed.

Failure Modes

Workers that don’t report status, leaving the board inaccurate. Overly granular boards that consume more attention than the work itself.

Execution Journal, Planner-Executor Split

Execution Journal

Intent

Maintain a chronological, append-only record of all significant events during execution for debugging, auditing, and learning.

Context

When problems occur — a wrong result, a timeout, an unexpected failure — the team needs to reconstruct what happened. In traditional software, we have logs. In agentic systems, we need richer records that capture not just what happened but the reasoning behind decisions.

Forces

Comprehensive journals are expensive in storage and attention
Sparse journals leave gaps that make debugging impossible
Journal format must balance human readability with machine parseability

Structure

Each journal entry contains: timestamp, event type (decision, action, observation, error, escalation), acting agent, context summary, inputs, outputs, duration, and links to related entries. Entries are immutable and append-only.

Dynamics

Every significant event is journaled as it occurs. The kernel can query the journal to detect patterns (repeated failures, escalation spirals, resource exhaustion). Post-execution analysis uses the journal to extract lessons.

Benefits

Full execution reconstruction. Pattern detection. Learning foundation. Compliance evidence.

Tradeoffs

Storage growth. Write overhead. Risk of sensitive data in journals.

Failure Modes

Journal writes that block execution. Journals too verbose to be useful.

Auditable Action, Active Plan Board, Checkpoints and Rollback

Checkpoints and Rollback

Intent

Create saveable execution states at strategic points so the system can recover from failures by reverting to a known-good state rather than restarting from scratch.

Context

Long-running agentic tasks — multi-step refactors, research synthesis, complex deployments — accumulate significant work before completion. A failure near the end should not require repeating everything from the beginning.

Forces

Frequent checkpoints increase resilience but add overhead
Checkpoint restoration must include memory, plan state, and intermediate results
Not all state is easily serializable (e.g., external side effects)

Structure

A checkpoint captures: the current plan board state, working memory contents, completed step results, and any relevant context. Checkpoints are stored with identifiers and timestamps. On failure, the kernel can load a checkpoint and resume from that point.

Dynamics

Kernel reaches checkpoint trigger (step completion, time interval, risk boundary) → Serialize state → Store checkpoint → Continue execution. On failure: load most recent valid checkpoint → Adjust plan to skip completed steps → Resume.

flowchart TD
  Exec[Executing] --> Trig{Checkpoint Trigger?}
  Trig -->|Yes| Save[Save State]
  Save --> Continue[Continue]
  Trig -->|No| Continue
  Continue --> Fail{Failure?}
  Fail -->|No| Exec
  Fail -->|Yes| Load[Load Last Checkpoint]
  Load --> Adjust[Adjust Plan]
  Adjust --> Exec

Benefits

Resilient long-running tasks. Reduced waste on failure. Enables experimentation with rollback safety.

Tradeoffs

Checkpoint serialization cost. Storage requirements. External side effects cannot be rolled back.

Failure Modes

Checkpoints that capture inconsistent state. Over-reliance on rollback instead of fixing root causes.

Active Plan Board, Failure Containment, Execution Journal

Failure Containment

Intent

Prevent a failure in one part of the system from cascading into other parts.

Context

In a system with multiple concurrent workers, shared memory, and interconnected plans, a single failure can propagate rapidly. A stuck worker can exhaust resources. A corrupted memory entry can mislead other workers. A failed operator can trigger retry storms.

Forces

Isolation prevents cascading failure but limits beneficial interaction
Detection must be fast enough to contain before spread
Recovery must not introduce new failure modes

Structure

Each worker process operates within an isolation boundary — its own memory scope, capability set, and resource budget. Failures are detected through timeouts, output validation, and health checks. When a failure is detected, the failing process is terminated or suspended without affecting peer processes.

Dynamics

Worker fails → Isolation boundary contains the failure → Kernel detects via health check → Kernel decides: retry, replace, skip, or escalate → Sibling processes continue unaffected.

Benefits

System resilience. Predictable failure scope. Continued operation after partial failure.

Tradeoffs

Isolation boundaries add overhead. Over-isolation prevents useful cooperation.

Failure Modes

Failures in shared resources (memory plane, operator fabric) that bypass process boundaries. Detection delays that allow cascading before containment.

Subagent as Process, Context Sandbox, Resource Envelope

Staged Autonomy

Intent

Incrementally expand an agent’s autonomous authority as it demonstrates competence, rather than granting full autonomy from the start.

Context

New agentic workflows, new operators, and new domains carry uncertainty. We do not know in advance whether the system will behave correctly. Granting full autonomy in uncertain conditions is reckless, but requiring human approval for every action is impractical.

Forces

Full autonomy is efficient but risky in untested scenarios
Full supervision is safe but defeats the purpose of automation
Trust should be earned through demonstrated competence

Structure

Define autonomy stages: supervised (human approves every action), guided (human approves only high-risk actions), autonomous (system acts freely within policy), adaptive (system proposes policy changes). Workflows start at supervised and advance based on success metrics.

Dynamics

New workflow begins at supervised → Track success rate, error rate, escalation rate → When metrics meet threshold, propose promotion → Human approves stage advancement → Repeat. Failures can trigger demotion.

flowchart LR
  S[Supervised] -->|metrics met| G[Guided]
  G -->|metrics met| A[Autonomous]
  A -->|metrics met| Ad[Adaptive]
  A -->|incident| G
  G -->|incident| S
  Ad -->|incident| A

Benefits

Controlled risk exposure. Trust-building through evidence. Gradual optimization.

Tradeoffs

Slower initial deployment. Metrics design is critical and non-trivial.

Failure Modes

Premature promotion. Metrics that don’t capture actual risk. Stage demotion that isn’t triggered by real incidents.

Risk-Tiered Execution, Human Escalation, Permission Gate

Resource Envelope

Intent

Define hard boundaries on the resources any single execution can consume — time, tokens, memory, operator invocations — to prevent runaway processes.

Context

Agentic systems can enter loops, recursive decompositions, or retry spirals that consume unbounded resources. Unlike traditional programs with fixed execution paths, agents can generate arbitrary amounts of work.

Forces

Unbounded execution risks cost explosions and system starvation
Tight boundaries may terminate legitimate complex work prematurely
Different tasks have legitimately different resource needs

Structure

Each process or execution receives a resource envelope: maximum tokens (input + output), maximum wall-clock time, maximum operator invocations, maximum subprocess spawns. The kernel enforces these boundaries. When a boundary is approached, the agent receives a warning. When exceeded, execution is suspended.

Dynamics

Kernel sets envelope at process creation → Worker executes within envelope → Budget warnings at 80% → Hard stop at 100% → Worker must checkpoint or summarize before termination → Kernel decides whether to allocate additional budget or terminate.

Benefits

Predictable costs. System stability. Prevention of runaway loops.

Tradeoffs

Legitimate complex tasks may hit boundaries. Envelope sizing requires experience.

Failure Modes

Envelopes set too tight, killing valid work. Envelopes set too loose, providing no real protection.

Context Budget Enforcement, Failure Containment, Subagent as Process

Context Budget Enforcement

Intent

Actively manage the amount of context (tokens, memory entries, retrieved documents) consumed by any single reasoning step to maintain quality and control costs.

Context

Language models degrade in quality when context windows are saturated with irrelevant information. More context does not always mean better results — it often means more noise, higher costs, and slower processing.

Forces

Agents tend to accumulate context without pruning
Relevant context is essential for quality; irrelevant context degrades it
Context costs scale linearly with token count

Structure

Each reasoning step has a context budget: maximum tokens from memory retrieval, maximum documents from search, maximum history entries. The memory plane enforces these budgets through relevance scoring and truncation. Retrieved context is ranked and only the top entries within budget are provided.

Dynamics

Agent requests context → Memory plane retrieves candidates → Relevance scoring ranks candidates → Budget enforcement truncates to top-k within token limit → Agent receives curated, high-relevance context.

Benefits

Higher reasoning quality. Predictable costs. Faster processing. Reduced noise.

Tradeoffs

Aggressive budgeting may exclude relevant information. Relevance scoring must be calibrated.

Failure Modes

Budget enforcement that drops critical context. Relevance scoring that favors recency over importance.

Resource Envelope, Memory on Demand, Compression Pipeline

Applicability Guide

Runtime patterns manage execution state, failure recovery, and resource consumption. They become critical as task complexity and autonomy increase.

Decision Matrix

Pattern	Apply When	Do Not Apply When
Active Plan Board	Plans are multi-step and need to be inspectable, modifiable, and trackable during execution	Tasks are single-step or follow a fixed script; plan visibility adds no value
Execution Journal	You need a detailed record of what the system did and why — for debugging, audit, or learning	The system is a stateless query-response service with no need for execution history
Checkpoints and Rollback	Long-running tasks risk partial failure; you need the ability to resume from a known-good state	Tasks are short enough that restarting from scratch is cheaper than maintaining checkpoint infrastructure
Failure Containment	Worker failures must not cascade to other workers or corrupt shared state	A single worker handles everything; there is nothing for failures to cascade to
Staged Autonomy	The system’s autonomy should increase over time as it demonstrates reliability; trust must be earned	The system has a fixed, well-understood autonomy level that does not need to evolve
Resource Envelope	Workers can consume unbounded resources (tokens, time, API calls); budget enforcement is needed	Resource consumption is naturally bounded by the task structure; enforcement adds overhead without protection
Context Budget Enforcement	Context windows are limited; memory retrieval must be selective; relevance ranking matters	All relevant context fits in the window; or the system does not use retrieved memory

Build Order

Execution Journal — start logging from day one. It is cheap and invaluable for debugging.
Resource Envelope — add budget limits before your first large-scale test. Without them, a single runaway task can exhaust your API budget.
Context Budget Enforcement — add as soon as memory retrieval is part of your system. Quality degrades rapidly with irrelevant context.
Active Plan Board — add when you implement multi-step planning.
Failure Containment and Checkpoints — add when tasks become long-running (minutes, not seconds).
Staged Autonomy — add when you have enough operational history to calibrate trust levels.

Evolution Patterns

These patterns address how an Agentic OS grows, adapts, and extends over time without collapsing into chaos. A system that cannot evolve is dead. A system that evolves without discipline is dangerous.

Patternized Skills

Intent

Capture proven sequences of reasoning, operator usage, and decision-making as reusable, versioned skills that agents can invoke.

Context

Agents repeatedly solve similar classes of problems — refactoring code, summarizing research, triaging tickets. Each time, the agent reasons from scratch, sometimes well, sometimes poorly. This is wasteful and inconsistent.

Forces

Ad hoc reasoning is flexible but inconsistent
Rigid templates are consistent but brittle
Skills must evolve as domains and tools change

Structure

A skill is a structured artifact containing: the problem class it addresses, the recommended decomposition strategy, the operators typically needed, the memory patterns to apply, the governance constraints, and example execution traces. Skills are versioned and stored in a skill registry.

Dynamics

Agent encounters a problem → Skill registry matches problem class → Agent loads relevant skill → Skill guides decomposition and operator selection → Agent adapts skill to specific context → Execution results feed back into skill refinement.

Benefits

Consistency. Faster execution. Knowledge preservation. Onboarding acceleration.

Tradeoffs

Skill maintenance burden. Risk of applying stale skills to changed domains.

Failure Modes

Skills that become gospel instead of guidance. Skill registries that grow without curation.

Reusable Worker Archetypes, Operator Adapters, Governed Extensibility

Reusable Worker Archetypes

Intent

Define standard worker templates — archetypes — that can be instantiated for common roles: researcher, reviewer, coder, analyst, summarizer.

Context

Many agentic workflows use workers with similar configurations: a code reviewer always needs source access, linting tools, and a quality rubric. Configuring these from scratch each time is error-prone and slow.

Forces

Custom workers are flexible but expensive to configure
Standard archetypes are efficient but may not fit every situation
Archetypes must be customizable without losing their core character

Structure

A worker archetype defines: the worker’s role, default capability set, standard operators, memory access patterns, quality criteria, and typical interaction patterns. Archetypes are parameterized — the “code reviewer” archetype can be instantiated with different language contexts, style guides, and risk tolerance.

Dynamics

Kernel needs a worker for a specific role → Selects matching archetype → Instantiates with task-specific parameters → Worker operates with archetype defaults plus customizations → Post-task review identifies archetype improvements.

Benefits

Rapid worker provisioning. Consistent quality per role. Captured organizational knowledge about effective configurations.

Tradeoffs

Archetype maintenance. Risk of forcing problems into existing archetypes when a novel approach would be better.

Failure Modes

“One archetype fits all” thinking. Archetypes that diverge from actual effective practice.

Patternized Skills, Subagent as Process, Scoped Worker Contract

Operator Adapters

Intent

Create a uniform interface layer over heterogeneous external tools and services so that operators can be swapped, upgraded, or replaced without changing the agents that use them.

Context

External tools change their APIs. New tools emerge that are better than current ones. Multiple tools provide similar functionality with different interfaces. Agents should not be coupled to specific tool implementations.

Forces

Direct tool coupling is simple but creates fragile dependencies
Abstraction layers add indirection but enable evolution
Adapter quality determines whether abstraction helps or hurts

Structure

An operator adapter implements a standard interface (the operator contract) and translates between that interface and the specific external tool. The adapter handles authentication, error mapping, rate limiting, and response normalization. Agents interact only with the standard interface.

Dynamics

Agent invokes operator through standard interface → Adapter translates to tool-specific API → Tool executes → Adapter normalizes response → Agent receives standard response format. Tool upgrades or replacements only require adapter changes.

Benefits

Tool independence. Smooth migrations. Consistent error handling across diverse tools.

Tradeoffs

Adapter development and maintenance. Potential loss of tool-specific features behind a generic interface.

Failure Modes

Leaky abstractions where tool-specific errors propagate through. Adapters that eliminate capabilities unique to specific tools.

Tool as Operator, Operator Registry, Operator Isolation

Domain-Specific Agentic OS

Intent

Create specialized operational systems optimized for a particular domain — code engineering, research, customer support, compliance — rather than a single general-purpose system.

Context

A general-purpose Agentic OS can handle any domain but excels at none. Different domains have fundamentally different requirements: code engineering needs tight tool integration and test feedback loops; research needs broad source access and uncertainty tracking; customer support needs knowledge bases and escalation policies.

Forces

Generality provides flexibility but diffuses capability
Specialization provides depth but limits scope
Organizations handle multiple domains

Structure

A domain-specific Agentic OS inherits the core architecture (kernel, process fabric, memory plane, operator fabric, governance plane) but customizes: the skill library, the operator registry, the memory schemas, the governance policies, and the worker archetypes for its specific domain.

Dynamics

Organization identifies a domain with sufficient volume and pattern regularity → Forks the base OS configuration → Adds domain skills, operators, and policies → Operators develop domain expertise through accumulated memory → Domain OS becomes increasingly effective.

Benefits

Deep domain performance. Appropriate governance per domain. Cleaner evolution paths.

Tradeoffs

Duplication across domain OSs. Integration complexity for cross-domain workflows.

Failure Modes

Over-specialization that prevents cross-domain learning. Under-specialization that’s just the general system with a label.

Meta-Orchestrator, Capability Marketplace, Patternized Skills

Meta-Orchestrator

Intent

Coordinate work across multiple specialized Agentic OS instances when a task spans several domains.

Context

A complex business process might require code changes (handled by the Coding OS), documentation updates (handled by the Writing OS), compliance checks (handled by the Compliance OS), and customer notification (handled by the Support OS). No single OS handles all of this.

Forces

Single-OS execution is simpler but limited to one domain
Multi-OS coordination enables cross-domain workflows but adds orchestration complexity
Each OS has its own governance which the meta-orchestrator must respect

Structure

The meta-orchestrator is itself an Agentic OS that understands domain boundaries, maintains a cross-domain plan, and delegates sub-intents to the appropriate domain OS. It receives results from each domain OS, reconciles them, and tracks cross-domain dependencies.

Dynamics

Meta-orchestrator receives cross-domain intent → Decomposes by domain → Delegates to domain OSs → Tracks progress across domains → Reconciles results → Handles cross-domain dependencies → Reports consolidated outcome.

Benefits

Cross-domain workflow support. Preserved domain specialization. Coordinated execution.

Tradeoffs

Orchestration overhead. Cross-domain governance complexity. Error correlation across domains.

Failure Modes

Meta-orchestrator that micromanages domain OSs. Domain OSs that cannot operate independently.

Domain-Specific Agentic OS, Multi-OS Coordination, Active Plan Board

Capability Marketplace

Intent

Enable discovery and composition of capabilities (skills, operators, worker archetypes) across organizational boundaries through a shared registry.

Context

As an organization develops multiple domain OSs and accumulates skills and operators, valuable capabilities become siloed. A skill developed for the customer support domain might be useful in the research domain. An operator adapter for a data service might be needed across all domains.

Forces

Siloed capabilities lead to duplication and inconsistency
Shared capabilities require quality standards and compatibility guarantees
Openness enables innovation but introduces quality risk

Structure

The capability marketplace is a curated registry where teams can publish and discover skills, operators, worker archetypes, and policy packs. Each published capability includes: description, interface contract, quality metrics, governance requirements, version history, and usage examples.

Dynamics

Team develops a useful capability → Publishes to marketplace with metadata → Other teams discover via search → Consumers evaluate compatibility → Adoption with optional customization → Usage metrics and feedback improve the capability.

Benefits

Knowledge sharing. Reduced duplication. Cross-pollination. Community-driven quality.

Tradeoffs

Marketplace governance overhead. Version management across consumers. Quality consistency.

Failure Modes

Marketplace pollution with low-quality capabilities. Version conflicts across consumers. Abandoned capabilities.

Operator Registry, Patternized Skills, Governed Extensibility

Governed Extensibility

Intent

Allow the system to be extended with new capabilities, operators, and skills while maintaining governance invariants.

Context

A system that cannot be extended becomes obsolete. A system that can be extended without controls becomes unstable. The tension between extensibility and governance is fundamental.

Forces

Openness enables adaptation and innovation
Openness without governance enables chaos
Extension points must be designed, not accidental

Structure

The system defines explicit extension points: operator adapters, skill packages, policy modules, worker archetypes. Each extension point has a contract that extensions must satisfy. Extensions are validated against the contract before activation. Governance policies define who can publish extensions, what testing is required, and what approval flow applies.

Dynamics

Developer creates extension → Validates against contract → Submits for review → Governance pipeline evaluates (automated tests, policy compliance, security scan) → Approved extensions are published → Activated extensions operate within the system’s governance framework.

Benefits

Safe evolution. Innovation within guardrails. Quality assurance for extensions.

Tradeoffs

Extension development overhead. Governance pipeline latency. Contract design complexity.

Failure Modes

Contracts that are too restrictive, preventing useful extensions. Contracts that are too permissive, admitting low-quality extensions.

Capability Marketplace, Operator Adapters, Patternized Skills

Applicability Guide

Evolution patterns govern how the system grows and adapts over time. They are relevant for systems that will live beyond a prototype — but premature investment in evolution infrastructure is a common trap.

Decision Matrix

Pattern	Apply When	Do Not Apply When
Patternized Skills	You have recurring task types that benefit from codified instructions, tools, and strategies	Every task is unique; or you are still discovering what skills the system needs
Reusable Worker Archetypes	Multiple projects need the same types of workers (coder, reviewer, researcher); standardization reduces duplication	You have a single project with bespoke worker types that will not be reused
Operator Adapters	External tool APIs change frequently; you need an abstraction layer to isolate the system from API drift	You integrate with a single stable API that has not changed in years
Domain-Specific Agentic OS	A vertical domain (legal, medical, financial) has unique requirements that justify a specialized system	A general-purpose system with skill packages is sufficient for your domain needs
Meta-Orchestrator	You need to coordinate multiple independent Agentic OSs; cross-OS workflows are a real requirement	A single OS with internal modularity handles all your domains
Capability Marketplace	Multiple teams develop and share skills, tools, and policies; a distribution mechanism is needed	A single team builds everything; sharing infrastructure adds overhead without benefit
Governed Extensibility	Third parties or untrusted teams contribute extensions; you need safety guarantees for extensions	All extensions are built by a trusted core team; governance overhead is not justified

Evolution Timing

Before launch: Invest in Operator Adapters (insulate from external API changes) and Patternized Skills (codify what you already know works).

After 3 months of operation: Introduce Reusable Worker Archetypes (standardize what you have learned) and evaluate whether Governed Extensibility is needed.

After 6+ months, multiple teams: Consider Capability Marketplace and Meta-Orchestrator only if you have genuine multi-team or multi-OS coordination needs.

Domain-Specific Agentic OS is a strategic decision, not an incremental one. Build it when you have enough domain expertise and operational evidence to justify the investment.

From Requests to Intent

Most software begins with a well-defined input and ends with a well-defined output. A function takes arguments. An API receives a payload. A CLI parses flags. The contract is crisp: the caller knows what to ask for, and the system knows what to deliver.

Agentic systems break this contract — on purpose.

The Gap Between What Is Said and What Is Meant

When a human says “fix the login bug,” they rarely mean only that. They mean: find the bug, understand the root cause, fix it without breaking other things, verify the fix works, and ideally make the surrounding code a little more robust so the same class of bug does not recur. They also mean: do not touch unrelated code, do not refactor the entire auth module, and keep the pull request reviewable.

None of that is in the words “fix the login bug.”

In traditional systems, we bridge this gap with specifications, tickets, acceptance criteria — human-written documents that attempt to make implicit intent explicit. In agentic systems, the system itself must bridge the gap. The cognitive kernel must interpret the request, infer the full intent, and construct a plan that serves what was meant, not just what was said.

Intent Is Not a String

This is the fundamental shift. The input to an agentic system is not a structured command but a goal — often underspecified, context-dependent, and laden with implicit expectations.

Intent has layers:

Surface intent: The literal ask. “Write a function that sorts users by last login.”
Operational intent: The practical constraints. It should be efficient, handle edge cases, fit the project’s style.
Strategic intent: The deeper goal. We are building an admin dashboard and need to identify inactive accounts.
Boundary intent: What should not happen. Do not change the database schema. Do not introduce new dependencies.

A capable system does not just parse the surface. It reconstructs as many layers as it can from context — the codebase, recent changes, the user’s history, the project’s conventions — and asks for clarification only when the ambiguity would lead to meaningfully different outcomes.

The Intent Interpretation Pipeline

The cognitive kernel processes incoming requests through an interpretation pipeline:

flowchart TD
  R[Incoming Request] --> P[1. Parse the Request\nExtract literal content]
  P --> LC[2. Load Context\nCodebase, history, conventions]
  LC --> CC[3. Classify Complexity\nSimple → direct execution\nComplex → full planning]
  CC --> IC[4. Infer Implicit Constraints\nUnstated rules and expectations]
  IC --> RA[5. Resolve Ambiguity\nGuess if low-risk, ask if high-risk]
  RA --> SI[6. Produce Structured Intent\nGoal + constraints + success criteria]

1. Parse the Request

Extract the literal content: what action is requested, what artifacts are mentioned, what constraints are stated. This is the easy part.

2. Load Context

Gather the relevant surrounding information: the current state of the codebase, recent conversations, active tasks, user preferences, project conventions. Context is what transforms a vague request into a grounded one.

3. Classify Complexity

Is this a single-step task or a multi-step workflow? Does it require specialized knowledge? Does it touch sensitive areas? The classification determines how much planning infrastructure the system will deploy. Small tasks get direct execution. Large tasks get full planning.

4. Infer Implicit Constraints

What goes without saying? If the user asks to fix a bug in a production service, the implicit constraints include: do not break the build, do not introduce regressions, follow existing patterns, keep changes minimal. These are not stated because they are always true. The system must know them anyway.

5. Resolve Ambiguity

When the request is ambiguous and the ambiguity matters, the system must decide: resolve it autonomously using available context, or escalate to the user. The decision depends on the cost of guessing wrong. Low-risk ambiguity can be resolved by best guess. High-risk ambiguity demands clarification.

6. Produce a Structured Intent

The output of the pipeline is not the original string — it is a structured representation of the goal: what to achieve, what to avoid, what constraints apply, what success looks like, and what level of autonomy is appropriate.

The Art of Asking

A system that asks for clarification on every ambiguity is not intelligent — it is annoying. A system that never asks and guesses wrong is not autonomous — it is reckless.

The right balance is contextual. Early in a relationship (when the system has little history with a user), it should ask more. Over time, as it accumulates context about preferences, patterns, and boundaries, it should ask less. This is not just politeness — it is efficiency. Every question interrupts the user’s flow. Every wrong guess wastes compute and trust.

The best agentic systems learn to model their operators. They notice that this user always wants verbose logging. That this team prefers small PRs. That this project has strict linting rules. These observations compress future intent interpretation: the system can infer more from less.

Intent vs. Instruction

There is a useful distinction between intent and instruction:

Intent is what you want to achieve. “Make the homepage faster.”
Instruction is how you want it done. “Add lazy loading to all below-the-fold images.”

Agentic systems should accept both, but prefer intent. When given intent, they can choose the best approach given the current context. When given instruction, they execute faithfully but have less room to add value.

The highest-leverage interactions are those where the human provides intent and constraints, and the system provides the plan and execution. This is the division of labor that makes agentic systems worthwhile.

From Request to Plan

Once intent is structured, the kernel can plan. But the transition from intent to plan is itself non-trivial. The next chapter examines how complex intents are decomposed into executable work.

The key insight here is that intent interpretation is not preprocessing — it is the first act of intelligence in the system. Get it wrong, and everything downstream is wasted effort pointed in the wrong direction. Get it right, and even a modest execution engine produces valuable results.

Intent is the operating system’s boot sequence. Everything starts here.

Problem Decomposition

A single well-stated goal can hide enormous complexity. “Build me a REST API for user management” sounds like one task. It is dozens: design the data model, define endpoints, implement authentication, write validation logic, handle error cases, set up database migrations, add tests, document the API. Each of those contains subtasks of its own.

Decomposition — breaking a problem into smaller, manageable pieces — is perhaps the most important cognitive operation in an agentic system. It is the moment where a vague mountain of work becomes a structured plan.

Why Decomposition Matters

Language models have finite context windows and finite reasoning depth. A model asked to “build the entire API” in one pass will produce something superficial — correct in shape but wrong in details. The same model asked to “implement the email validation function for user registration” will produce something precise, tested, and robust.

This is not a limitation to work around. It is a fundamental principle: focused context produces better results.

Decomposition is how the Agentic OS turns broad intent into focused work. Each subtask gets a worker with a scoped context, a clear objective, and explicit success criteria. The worker does not need to know about the entire API. It needs to know about email validation.

The Decomposition Spectrum

Not all decomposition is equal. Tasks fall along a spectrum of decomposability:

flowchart LR
  subgraph Trivial["Trivially\nDecomposable"]
    direction TB
    T1[Task A] 
    T2[Task B]
    T3[Task C]
  end
  subgraph Sequential["Sequentially\nDecomposable"]
    direction TB
    S1[Step 1] --> S2[Step 2] --> S3[Step 3]
  end
  subgraph Graph["Graph\nDecomposable"]
    direction TB
    G1[A] --> G3[C]
    G2[B] --> G3
    G3 --> G4[D]
  end
  subgraph Iterative["Iteratively\nDecomposable"]
    direction TB
    I1[Step] --> I2[Learn] --> I3[Replan]
    I3 -.-> I1
  end
  subgraph NonDecomp["Non-\nDecomposable"]
    direction TB
    N1[Holistic\nReasoning]
  end

Trivially Decomposable

Tasks that split naturally into independent parts. “Rename all occurrences of userId to user_id in these five files.” Five independent find-and-replace operations. No dependencies, no coordination needed.

Sequentially Decomposable

Tasks with a natural order. “Parse the CSV, validate each row, transform to JSON, upload to the API.” Each step depends on the previous one’s output. The decomposition is a pipeline.

Graph-Decomposable

Tasks with partial ordering. Some subtasks depend on others; many can run in parallel. “Build a dashboard: fetch user data (API), fetch analytics data (API), design layout (UI), render charts (UI, depends on data), compose page (depends on layout and charts).” This is a dependency graph.

Iteratively Decomposable

Tasks where the decomposition itself evolves. “Research the best caching strategy for our application.” You cannot decompose this upfront because you do not know what you will find. Each step — analyzing the workload, reviewing options, prototyping a solution — may change the plan.

Non-Decomposable

Tasks that require holistic reasoning. “Review this code for architectural coherence.” This cannot be split because the insight comes from seeing the whole. These tasks must be given to a single worker with sufficient context.

Decomposition Strategies

The cognitive kernel employs different strategies depending on the task:

Functional Decomposition

Split by what needs to be done. Each subtask is a distinct function: “design the schema,” “implement the endpoint,” “write the tests.” This is the most common strategy and works well when the functions are relatively independent.

Data Decomposition

Split by what data is being processed. “Analyze logs from the last 30 days” becomes 30 parallel tasks, one per day. This is powerful when the operation is uniform and the data partitions cleanly.

Temporal Decomposition

Split by when things happen. “Set up the deployment pipeline” decomposes into: build stage, test stage, staging deploy, production deploy. Each phase is a natural boundary.

Risk-Based Decomposition

Split by risk level. Separate the safe changes from the risky ones. Apply the safe changes first, validate, then proceed to the risky changes with human approval. This strategy interleaves with the governance plane.

The Decomposition Contract

Each subtask produced by decomposition should carry a contract:

Objective: What this subtask must achieve, stated precisely.
Inputs: What information or artifacts the subtask receives.
Outputs: What the subtask must produce.
Constraints: What the subtask must not do.
Success criteria: How to verify the subtask was completed correctly.
Dependencies: What other subtasks must complete before this one can start.

This contract is not bureaucracy — it is the mechanism that allows independent workers to produce results that fit together. Without it, you get a puzzle where the pieces were cut by different people using different templates.

Depth of Decomposition

How far should you decompose? Too shallow, and subtasks are still too complex for focused execution. Too deep, and the coordination overhead drowns out the work.

The right depth depends on two factors:

Worker capability: How complex a task can a single worker handle reliably? This varies with the model, the domain, and the available tools: larger models, well-tooled domains, and routine tasks allow shallower decomposition.
Coordination cost: Each additional level of decomposition adds overhead — more context assembly, more result consolidation, more potential for miscommunication.

The sweet spot is where each subtask is small enough to be executed with high reliability but large enough to carry meaningful context. In practice, this is often 2-3 levels of decomposition for complex tasks.

Decomposition Failures

Decomposition can go wrong in characteristic ways:

Over-decomposition: Splitting a task so finely that each piece lacks the context needed to make good decisions. A function split into “write line 1,” “write line 2” is absurd. But less obvious versions of this happen when semantic units are broken across workers.
Under-decomposition: Leaving a task too large for reliable execution. The worker produces something that looks complete but collapses under scrutiny.
Wrong boundaries: Splitting at the wrong seam. Two subtasks that should share context are separated; two unrelated subtasks are grouped. This leads to duplicate work or contradictory outputs.
Missing dependencies: Failing to identify that subtask B needs the output of subtask A. The result is a worker blocked on information it does not have, forced to guess or fail.
Circular dependencies: A decomposition where A depends on B and B depends on A. This happens when the decomposition does not respect the natural information flow.

Decomposition as a First-Class Operation

In the Agentic OS, decomposition is not an informal step buried inside a prompt. It is a first-class operation with explicit inputs (the structured intent) and explicit outputs (a task graph). The kernel can inspect it, modify it, visualize it. When a plan fails, the kernel can ask: Was the decomposition wrong? Should we re-decompose?

This explicitness is what separates an Agentic OS from a chatbot chain. The chatbot chain decomposes implicitly — each link in the chain vaguely hands off to the next. The Agentic OS decomposes explicitly — the task graph is a data structure that can be reasoned about, optimized, and adapted.

The next chapter explores what happens after decomposition: how the system executes, monitors, and adapts the plan.

Planning, Acting, Checking, Adapting

A plan is a hypothesis about how to achieve a goal. Like all hypotheses, it is wrong — or at least incomplete. The question is not whether the plan will survive contact with reality, but how quickly the system can detect the mismatch and adapt.

This chapter describes the execution loop at the heart of the Agentic OS: the cycle of planning, acting, checking, and adapting that turns a decomposed task graph into a delivered result.

The Execution Loop

flowchart LR
  Plan --> Act --> Check --> Adapt
  Adapt -.->|repeat| Plan

This is not a waterfall. It is a tight loop that runs at every level of the system: at the macro level (the overall task), at the subtask level (each decomposed unit), and at the micro level (each individual action within a worker).

Plan

The planner — a function of the cognitive kernel — takes the current state, the goal, and the available resources, and produces a plan: an ordered set of steps with dependencies, resource requirements, and success criteria.

Planning is not a one-time event. The initial plan is the best guess given what is known at the start. It will be revised.

The plan is a data structure, not prose. It has nodes (tasks), edges (dependencies), annotations (risk levels, estimated cost), and metadata (which governance policies apply). This structure makes it inspectable, diffable, and shareable.

Act

Execution happens through the process fabric. Each step in the plan is assigned to a worker — a scoped process with defined inputs, tools, and boundaries. The worker executes its step and produces output.

Acting is where the system interacts with the real world: reading files, calling APIs, querying databases, generating code, writing documents. Each action has consequences, some of which are irreversible. The governance plane applies here — risky actions may be gated by approval.

Check

After each act, the system verifies the result. Did the action succeed? Does the output meet the success criteria? Are there side effects?

Checking takes multiple forms:

Output validation: Does the output match the expected format and content? Does generated code compile? Do tests pass?
State validation: Is the system in the expected state after the action? Did the database update correctly? Is the file where it should be?
Goal alignment: Does this result move us closer to the overall goal, or have we drifted?
Constraint compliance: Did the action stay within policy boundaries? Were any governance rules triggered?

Checking is not optional. Systems that skip checking rely on hope. Hope is not an engineering strategy.

Adapt

When checking reveals a mismatch — the output is wrong, a dependency failed, new information changes the picture — the system adapts.

Adaptation ranges from minor to radical:

Retry: The simplest adaptation. The action failed due to a transient issue. Try again, possibly with minor adjustments. But retrying the same thing the same way is not adaptation — it is denial.
Revise: The step’s approach was wrong. Try a different technique. If the regex-based parser failed, try an AST-based parser. This is local adaptation within a single step.
Replan: The plan itself is flawed. A dependency produced unexpected output. A new constraint was discovered. The kernel re-decomposes from the current state and produces a new plan.
Escalate: The system cannot resolve the issue autonomously. It escalates to a human operator with a clear explanation of what happened, what was tried, and what options remain.
Abort: The goal is unachievable, or the cost of continuing exceeds the value of the result. The system stops, reports why, and reclaims resources.

Depth of the Loop

The loop runs at multiple depths simultaneously:

flowchart TD
  subgraph Macro["Macro Loop — Overall Task (minutes–hours)"]
    direction LR
    MP[Plan task] --> MA[Execute steps] --> MC[Assess whole result] --> MAd[Re-decompose if needed]
    MAd -.-> MP
  end
  subgraph Meso["Meso Loop — Across Steps (seconds–minutes)"]
    direction LR
    MeP[Check plan validity] --> MeA[Run next step] --> MeC[Verify step output] --> MeAd[Revise plan]
    MeAd -.-> MeP
  end
  subgraph Micro["Micro Loop — Within a Worker (seconds)"]
    direction LR
    MiP[Generate] --> MiA[Check] --> MiC[Fix] --> MiAd[Retry]
    MiAd -.-> MiP
  end

  Macro --- Meso --- Micro

Micro Loop: Within a Worker

A single worker performing a single step runs its own plan-act-check-adapt cycle. It generates a function, checks if it compiles, adapts the implementation if it does not. This loop is fast and tight — measured in seconds.

Meso Loop: Across Steps

The kernel monitors progress across the task graph. As each step completes, the kernel checks whether the plan is still valid, whether dependencies are satisfied, and whether the next step should proceed as planned or be revised. This loop runs on the scale of minutes.

Macro Loop: The Overall Task

At the highest level, the kernel evaluates whether the overall goal is being achieved. After all the code was written, does the feature actually work? After all the sections were drafted, does the document make sense? This requires stepping back from the individual steps and assessing the whole. This loop may trigger complete re-decomposition.

The Feedback Problem

The quality of adaptation depends on the quality of feedback. If the system cannot tell whether an action succeeded, it cannot adapt.

This creates a design imperative: build checkability into every step.

Code tasks should include tests that validate the output.
Data tasks should include assertions on the data.
Writing tasks should include criteria that can be checked (word count, coverage of topics, absence of contradictions).
Integration tasks should include smoke tests.

When feedback is unavailable or unreliable, the system must proceed with lower confidence and narrower autonomy — checking with humans more frequently.

Planning Strategies

Not all tasks need the same planning approach:

Scripted Plans

For well-understood tasks with known steps. “Deploy this service: build, test, push image, update config, roll out.” The plan is essentially a script. The adaptation space is small — if a step fails, retry or abort.

Exploratory Plans

For tasks where the path is unknown. “Find out why latency increased last week.” The plan is a series of investigations. Each step’s result determines the next step. The plan evolves as knowledge accumulates.

Constraint-Satisfaction Plans

For tasks defined by constraints rather than steps. “Generate a test suite that covers all public methods and achieves 80% branch coverage.” The plan is iterative: generate tests, check coverage, generate more for uncovered paths, repeat until constraints are met.

Adversarial Plans

For tasks where the system must consider failure modes. “Migrate the database schema without downtime.” The plan includes rollback steps, canary checks, and fallback paths. Each step has a “what if this fails” contingency.

When Plans Fail

Plan failure is not system failure — it is information. A failed plan tells the system something it did not know before. The question is whether the system can learn from the failure and produce a better plan.

Common failure patterns:

Cascading failure: One step fails and the failure propagates through the dependency graph. The system must identify the root failure and re-plan from there, not retry each downstream step.
Silent failure: A step appears to succeed but produces subtly wrong output that causes failures later. This is the hardest to detect and requires robust checking at every stage.
Plan obsolescence: The world changed while the plan was executing. The file was modified by someone else. The API endpoint was deprecated. The requirements shifted. The system must detect this and replan.

The Cost of Iteration

Every loop iteration costs resources: model calls, tool invocations, time. A system that iterates too freely burns through budgets. A system that iterates too conservatively delivers poor results.

The kernel must manage this tradeoff explicitly:

Set iteration budgets per task and per step.
Track the trajectory: are iterations improving the result, or oscillating?
Recognize diminishing returns and accept “good enough.”
Report cost alongside results so operators can calibrate.

Execution as Learning

The most important insight about the execution loop is that each cycle produces knowledge, not just output. The system learns what works in this context, what constraints truly apply, and what shortcuts are viable. In a mature Agentic OS, this knowledge is captured in the memory plane and improves future performance.

The plan-act-check-adapt loop is the system’s method of thinking. It is trial and error made rigorous, feedback made structural, and learning made explicit. It is the difference between an agent that follows a script and an agent that solves problems.

One Agent, Many Agents, or Many OSs

The simplest agentic system is a single agent: one cognitive kernel, one process, one conversation. It receives a request, reasons about it, acts, and responds. Most chatbot interactions work this way. It is straightforward, easy to understand, and fundamentally limited.

The moment you need parallelism, specialization, or scale, you face an architectural decision that has no universal answer: should you use one agent that does many things, many agents coordinated as a team, or many independent operating systems that federate?

The Single-Agent Model

A single agent handles everything. It reads the request, plans, executes each step sequentially, and delivers the result. This is adequate when:

Tasks are simple and self-contained.
Sequential execution is acceptable.
The domain is narrow enough for one context window.

The limitation is cognitive: a single agent juggling ten concerns produces worse results than ten agents each focused on one. Context pollution — the mixing of unrelated information in a single context — degrades reasoning quality. A model simultaneously tracking database schema decisions, CSS styling choices, and deployment configuration is a model doing all three poorly.

The Multi-Agent Model

Multiple agents, each specialized, coordinated by the cognitive kernel. This is the Agentic OS’s primary mode for complex work.

Why Multiple Agents

The argument for multi-agent systems is the same argument for microservices, for modular code, for teams with specialized roles: separation of concerns produces better results.

A code-writing agent has a context loaded with the relevant source files, language idioms, and test patterns. A code-reviewing agent has a context loaded with quality standards, common bug patterns, and architectural principles. A documentation agent has a context loaded with writing guidelines and API references. Each agent is excellent at its specialty because it is only doing its specialty.

Coordination Patterns

Multi-agent systems require coordination. The Agentic OS supports several patterns:

flowchart LR
  subgraph pipeline["Pipeline"]
    direction LR
    PA[Agent A] --> PB[Agent B] --> PC[Agent C]
  end

flowchart TD
  subgraph fanout["Fan-Out / Fan-In"]
    direction TB
    K[Kernel] --> FA1[Agent 1]
    K --> FA2[Agent 2]
    K --> FA3[Agent 3]
    FA1 --> C[Consolidate]
    FA2 --> C
    FA3 --> C
  end

flowchart TD
  subgraph hierarchy["Hierarchy"]
    direction TB
    Sup[Supervisor] --> W1[Worker 1]
    Sup --> W2[Worker 2]
    W2 --> W2a[Sub-worker A]
    W2 --> W2b[Sub-worker B]
  end

flowchart TD
  subgraph consensus["Consensus"]
    direction TB
    P["Problem"] --> A1[Agent A]
    P --> A2[Agent B]
    P --> A3[Agent C]
    A1 --> R[Referee]
    A2 --> R
    A3 --> R
    R --> Best[Best Result]
  end

flowchart LR
  subgraph adversarial["Adversarial"]
    direction LR
    Prod[Producer] -->|output| Crit[Critic]
    Crit -->|feedback| Prod
    Prod -->|refined| Final[Final Output]
  end

Pipeline: Agents work in sequence. Agent A’s output becomes Agent B’s input. The code agent writes the function, the test agent writes the tests, the review agent checks both. Clean, predictable, but sequential.

Fan-Out / Fan-In: The kernel dispatches work to multiple agents in parallel, then consolidates the results. Five agents each analyze a different module simultaneously. Fast, but requires careful consolidation.

Hierarchy: A supervisor agent manages worker agents. The supervisor plans, delegates, and monitors. Workers execute. This mirrors the kernel-process relationship and scales well to deep task hierarchies.

Consensus: Multiple agents tackle the same problem independently, and a referee agent selects or synthesizes the best result. Expensive but robust for high-stakes decisions.

Adversarial: One agent produces, another critiques. The coder writes, the reviewer challenges. The planner proposes, the red team attacks. This produces more robust outputs at the cost of more compute.

The Coordination Tax

Every agent added to a system increases coordination overhead:

More context must be assembled for each agent.
More results must be consolidated.
More failure modes exist (what if agent 3 contradicts agent 2?).
More communication overhead (encoding and decoding information between agents).

This tax is real and non-trivial. A two-agent system is not twice as good as a single agent — it is potentially better at the task but definitely more expensive to run. The decision to use multiple agents must be justified by the improvement in output quality, not by architectural elegance.

The Multi-OS Model

Sometimes the right answer is not more agents within one OS, but multiple independent operating systems that collaborate.

When You Need Multiple OSs

Consider a large organization where different teams use different agentic systems:

The engineering team has a Coding OS tailored to their stack, conventions, and workflows.
The research team has a Research OS with access to academic databases, experiment tracking, and publication tools.
The support team has a Support OS connected to the ticket system, knowledge base, and customer data.

These are not agents within a shared system — they are independent operating systems with their own kernels, memory planes, governance rules, and operators. They have different security boundaries, different data access policies, and different optimization objectives.

Federation

Multi-OS coordination requires federation: a mechanism for independent systems to discover each other, negotiate capabilities, exchange work, and share results while respecting each system’s boundaries.

Federation is harder than multi-agent coordination. Within a single OS, the kernel has authority over all agents. Across OSs, there is no central authority. Coordination must be negotiated.

Key federation challenges:

Discovery: How does OS A know that OS B exists and can help with a particular task?
Trust: How does OS A verify that OS B will handle data responsibly? What governance policies apply at the boundary?
Protocol: What format do inter-OS messages take? How are tasks described, delegated, and results returned?
Conflict resolution: When OS A and OS B have conflicting information or recommendations, who wins?
Accountability: When a federated task fails, which OS is responsible?

The Organization as OS

At the highest level, you can think of an organization’s collection of agentic systems as an OS itself — a meta-OS whose processes are individual OSs, whose memory is the shared knowledge base, and whose governance is the organizational policies.

This is not just an analogy. It is a design pattern. The same principles that govern processes within an OS — isolation, communication channels, resource management, policy enforcement — govern the coordination of multiple OSs within an organization.

Choosing the Right Model

The choice between one agent, many agents, and many OSs depends on a small set of factors:

Factor	Single Agent	Multi-Agent	Multi-OS
Task complexity	Low	Medium-High	Very High
Domain breadth	Narrow	Wide	Cross-organizational
Parallelism need	None	Significant	Independent
Security boundaries	One	Shared	Separate
Team structure	One person	One team	Multiple teams
Governance model	Uniform	Uniform	Heterogeneous

In practice, most real systems are hybrid. A single-agent interaction for quick questions. Multi-agent orchestration for complex tasks. Multi-OS federation for cross-team workflows.

The Scaling Challenge

As you move from single agent to multi-agent to multi-OS, every dimension scales:

Context management scales from one window to many coordinated windows to distributed knowledge.
Planning scales from linear plans to parallel task graphs to federated work orders.
Governance scales from local policies to shared policies to negotiated policies.
Failure handling scales from retry to re-plan to cross-system recovery.
Cost scales from one model call to many to distributed billing.

The Agentic OS model handles this scaling because its abstractions — kernel, processes, memory, governance — are fractal. They apply at every level. A process within an OS and an OS within a federation follow the same structural patterns. This is not coincidence; it is the design principle that makes the OS analogy powerful.

The Right Amount of Agency

More agents is not always better. More coordination is not always smarter. The right architecture is the simplest one that achieves the goal reliably.

A single agent that reliably fixes a bug is better than a five-agent pipeline that does the same thing with more overhead. But a single agent that unreliably designs a distributed system is worse than a multi-agent team where each specialist contributes its expertise.

The Agentic OS does not prescribe a single topology. It provides the infrastructure — the process fabric, the memory plane, the governance plane — that supports any topology. The cognitive kernel chooses the topology at runtime, based on the task at hand. That adaptability is the system’s core strength.

Safety Without Killing Throughput

The easiest way to make an agentic system safe is to make it useless. Require human approval for every action. Restrict it to read-only operations. Limit it to pre-approved templates. The system will never do anything dangerous — or anything valuable.

The hardest and most important design challenge in agentic systems is achieving safety and throughput simultaneously. This chapter examines how the Agentic OS navigates this tension.

The Safety-Throughput Tradeoff

Safety and throughput are not fundamentally opposed, but they are in tension. Every safety check takes time. Every approval gate blocks execution. Every policy evaluation consumes resources. A system with no safety runs fast and breaks things. A system with maximum safety runs slowly and breaks nothing — including the user’s patience.

The goal is not to eliminate this tension but to manage it intelligently: apply heavy safety measures where the risk is high, and lightweight measures where the risk is low. This requires the system to accurately assess risk in real time.

Risk Assessment

Not all actions carry equal risk. The Agentic OS classifies actions along several dimensions:

Reversibility

Can the action be undone? Writing a file to a working directory is reversible (delete it). Sending an email is not. Pushing to a feature branch is reversible (force push). Pushing to main is recoverable but costly. Deleting a database table is catastrophic.

Reversible actions can proceed with low overhead. Irreversible actions demand proportionally higher scrutiny.

Blast Radius

How much damage can a failure cause? A wrong value in a local variable affects one function. A wrong value in a configuration file affects the entire service. A wrong value in a production database affects all users.

Small blast radius permits fast, autonomous action. Large blast radius requires stronger gates.

Confidence

How certain is the system about the correctness of the action? A rename refactoring supported by static analysis is high-confidence. A complex architectural change based on ambiguous requirements is low-confidence.

High-confidence actions can proceed freely. Low-confidence actions need verification.

Precedent

Has the system successfully performed this type of action before? A task similar to one completed successfully yesterday carries lower risk than a novel task type. History informs trust.

The Staged Autonomy Model

The Agentic OS implements safety through staged autonomy: different actions get different levels of oversight based on their risk profile.

flowchart LR
  L0["Level 0\nFull Autonomy\n(read files, run tests)"] --> L1["Level 1\nNotify\n(create branch, add dep)"]
  L1 --> L2["Level 2\nConfirm\n(modify prod config)"]
  L2 --> L3["Level 3\nSupervised\n(auth logic, finances)"]
  L3 --> L4["Level 4\nHuman-Initiated\n(delete prod data)"]

  style L0 fill:#27ae60,stroke:#2ecc71,color:#fff
  style L1 fill:#2ecc71,stroke:#27ae60,color:#fff
  style L2 fill:#f39c12,stroke:#e67e22,color:#fff
  style L3 fill:#e67e22,stroke:#d35400,color:#fff
  style L4 fill:#c0392b,stroke:#e74c3c,color:#fff

Level 0: Full Autonomy

The system acts without any human involvement. Reserved for low-risk, reversible, high-confidence actions. Reading files, running tests, generating code in a sandbox, formatting documents.

Level 1: Notify

The system acts and notifies the human afterward. For medium-low risk actions where the human should be aware but does not need to approve. Creating a branch, adding a dependency, modifying test files.

Level 2: Confirm

The system proposes an action and waits for human confirmation before proceeding. For medium-high risk actions. Modifying production configuration, running database migrations, sending external communications.

Level 3: Supervised

The system generates a detailed plan and the human reviews and approves each significant step. For high-risk actions in sensitive domains. Changes to authentication logic, financial calculations, data deletion.

Level 4: Human-Initiated

The system will not take the action on its own, even if asked. The human must perform it and tell the system it was done. For actions with catastrophic and irreversible consequences. Deleting production data, deploying to critical infrastructure, modifying security policies.

The mapping from action to level is not static. It evolves based on the system’s track record, the operator’s trust profile, and the current environment. A new system starts with many Level 2 actions that gradually move to Level 1 or 0 as trust is established.

Policy Enforcement Architecture

Safety policies are enforced through the governance plane, using a layered architecture:

flowchart TD
  A[Proposed Action] --> Pre["Pre-Action Policies\n• Capability check\n• Scope check\n• Risk check"]
  Pre -->|pass| Exec[Execute Action]
  Pre -->|blocked| Deny[Deny & Report]
  Exec --> During["During-Action Monitoring\n• Resource consumption\n• Trajectory monitoring\n• Anomaly detection"]
  During -->|anomaly| Pause[Pause & Escalate]
  During -->|normal| Post["Post-Action Validation\n• Output validation\n• Regression detection\n• Audit logging"]
  Post --> Done[Action Complete]

Pre-Action Policies

Before any action is taken, the kernel evaluates it against pre-action policies:

Capability check: Does this agent have the capability to perform this action? A code agent should not send emails. A research agent should not modify production databases.
Scope check: Is this action within the scope of the current task? An agent asked to fix a bug should not refactor the entire module.
Risk check: What is the risk level of this action? Does it exceed the autonomy level for the current context?

During-Action Monitoring

Some actions are monitored in real time:

Resource consumption: Is the agent consuming more tokens, time, or API calls than expected? Budget overruns may indicate a runaway process.
Trajectory monitoring: Is the agent making progress toward the goal, or has it entered a loop? Repeated similar actions without progress trigger intervention.
Anomaly detection: Is the agent doing something it has never done before in this context? Novel behaviors in sensitive domains may warrant a pause.

Post-Action Validation

After an action completes, the system validates the result:

Output validation: Does the output meet the expected format and constraints?
Regression detection: Did the action break something that was working? Run tests, check invariants.
Audit logging: Record what was done, why, and what the result was. This log is essential for accountability and debugging.

Designing for Speed

Safety does not have to mean slow. Several design strategies maintain throughput while preserving safety:

Parallel Approval Paths

While a risky action waits for human approval, safe actions continue executing. The system does not block on a single gate — it routes around it and returns when the gate opens.

Speculative Execution

The system begins executing a risky action in a sandbox while waiting for approval. If approved, the sandboxed result is promoted. If rejected, it is discarded. This eliminates the latency of waiting, at the cost of potentially wasted compute.

Pre-Approved Patterns

For recurring tasks, the system learns which action patterns are always approved and caches the approval. “This operator always approves test file modifications” becomes a permanent Level 0 policy for that action type.

Batch Approval

Instead of asking for approval on each of ten similar actions, the system presents them as a batch: “I need to modify these 10 configuration files. Here are the changes. Approve all?” One approval, ten actions.

Trust Escalation

As the system demonstrates reliability, its autonomy increases. An agent that has successfully deployed 50 times without incident may be granted higher autonomy for deployment tasks. Trust is earned, tracked, and revocable.

The Circuit Breaker

Every agentic system needs a circuit breaker: a mechanism that stops all activity when something goes seriously wrong. This is not a graceful degradation — it is an emergency stop.

Circuit breakers trigger on:

Repeated failures: More than N failures in a time window.
Budget exhaustion: Cost exceeds the allocated maximum.
Policy violation: An action that violates a hard governance rule.
Anomaly spike: A sudden increase in unusual behaviors.
External signal: A human hits the stop button.

When the circuit breaker trips, the system halts all active processes, preserves their state for debugging, and reports to the operator. No further actions are taken until the operator reviews and resets the system.

Safety as a Feature, Not a Constraint

The deepest insight about safety in agentic systems is that it is not a constraint on the system — it is a feature of the system. Users trust systems that are safe. Trust enables higher autonomy. Higher autonomy enables greater throughput.

A system that occasionally breaks things will have its autonomy reduced by its operators — manually, by adding more approval gates, by restricting capabilities. A system that reliably operates safely will have its autonomy expanded. Over time, the safe system is faster because it is trusted to do more without supervision.

Safety is not the enemy of throughput. Recklessness is the enemy of throughput, because recklessness destroys trust, and without trust, every action requires a human in the loop.

The Agentic OS builds safety into its architecture — in the governance plane, in the process fabric’s capability model, in the kernel’s risk assessment — so that it is not an afterthought bolted on but a structural property that enables everything else.

Reference Architecture

The previous parts of this book described the Agentic OS conceptually — its philosophy, its layers, its patterns, its approach to problem-solving. This part shifts to construction. How do you actually build one?

This chapter presents a reference architecture: a concrete, implementable blueprint for an Agentic OS. It is not the only possible architecture, but it is a coherent one that embodies the principles we have established.

Overview

The reference architecture has five major subsystems, mirroring the layers described in Part II:

block-beta
  columns 1
  OI["Operator Interface\nCLI, API, Chat, IDE Plugin"]
  GP["Governance Plane\nPolicies, Audit, Approval Gates, Budgets"]
  columns 3
  CK["Cognitive Kernel\nRouter, Planner, Scheduler"]
  PF["Process Fabric\nWorkers, Queues, Supervisors, Sandboxes"]
  MeP["Memory Plane\nWorking, Episodic, Semantic, Long-term Store"]
  columns 1
  TS["Tool & Skill Layer\nFile I/O, APIs, Databases, Search, Code Execution"]
  MP["Model Provider Layer\nLLM APIs, Embedding Models, Classifiers"]

  style OI fill:#1a5740,stroke:#3aaf7a,color:#e0f5ec
  style GP fill:#2b1f4e,stroke:#a78bfa,color:#e0f5ec
  style CK fill:#134a36,stroke:#3aaf7a,color:#e0f5ec
  style PF fill:#134a36,stroke:#3aaf7a,color:#e0f5ec
  style MeP fill:#0f3a2c,stroke:#2dd4bf,color:#e0f5ec
  style TS fill:#0f3a2c,stroke:#2dd4bf,color:#e0f5ec
  style MP fill:#0a2a1e,stroke:#3aaf7a,color:#7abfa8

Each subsystem is independent and communicates through well-defined interfaces. You can replace the model provider without touching the kernel. You can swap the memory store without affecting the process fabric. This is not accidental — it is the core design principle.

The Cognitive Kernel

The kernel is the entry point and coordinator. It receives requests from the operator interface and orchestrates everything else.

Components

Intent Router: Classifies incoming requests by complexity, domain, and risk. Routes simple requests directly to execution, complex requests to the planner.
Planner: Produces a task graph — a directed acyclic graph of subtasks with dependencies, resource requirements, and success criteria.
Scheduler: Prioritizes and sequences tasks from the plan, considering resource availability, policy constraints, and dependencies.
Result Consolidator: Collects outputs from completed tasks, resolves conflicts, and synthesizes the final result.

Implementation Notes

The kernel is stateless between requests. All state lives in the memory plane. This means the kernel can be restarted, scaled, or replaced without losing work in progress — the task graph and its state are persisted.

The kernel invokes the language model for reasoning (planning, classification, consolidation) but is not the language model. It is the control logic that decides when and how to invoke the model.

The Process Fabric

The process fabric manages the lifecycle of workers — the processes that actually execute tasks.

Components

Worker Pool: A set of available workers, each capable of executing tasks with specific tools and within specific sandboxes.
Process Manager: Spawns, monitors, and terminates workers. Tracks resource consumption (tokens, time, tool calls) per worker.
Sandbox Manager: Creates isolated execution environments for workers. Each sandbox defines what the worker can access: files, APIs, databases, network endpoints.
Communication Bus: The message-passing infrastructure that connects the kernel to workers and workers to each other when needed.

Worker Lifecycle

flowchart LR
  S[Spawn] --> I[Initialize\nload context, tools, policies] --> E[Execute] --> R[Report] --> T[Terminate]

Workers are ephemeral by default. They are created for a specific task, execute it, report results, and are terminated. Long-running workers are possible but are the exception, not the norm.

Isolation Model

Each worker runs in a scoped sandbox:

File access: Read/write permissions scoped to specific directories.
Tool access: Only the tools needed for the task. A code-writing worker has code tools. A research worker has search tools.
Model access: Budget-limited access to the language model. A worker cannot consume unbounded tokens.
Network access: Whitelisted endpoints only. No unrestricted internet access from workers.

This isolation is not about distrust — it is about focus. A worker with access to everything is a worker distracted by everything.

The Memory Plane

The memory plane provides persistence and retrieval across all time horizons.

Memory Tiers

Working Memory: The current context for an active task. Assembled by the kernel when spawning a worker, containing the task description, relevant code, conversation history, and constraints. Limited by context window size.
Episodic Memory: Records of past interactions, decisions, and outcomes. “Last time we deployed this service, we hit a rate limit on the external API.” Stored as structured events with timestamps and metadata.
Semantic Memory: Long-term knowledge indexed for retrieval. Project documentation, coding conventions, architectural decisions, API references. Stored as embeddings in a vector database.
Procedural Memory: Learned procedures and patterns. “When fixing a flaky test, first check for race conditions, then timing dependencies, then external service mocks.” Stored as retrievable strategies.

Memory Operations

Store: Write new information to the appropriate tier with metadata (source, confidence, timestamp, scope).
Retrieve: Query memory by relevance, recency, or explicit key. The retrieval system ranks results by a combination of semantic similarity, temporal relevance, and contextual fit.
Consolidate: Periodically compress and merge memories. Five separate episodic memories about the same deployment issue become one consolidated insight.
Forget: Remove memories that are no longer relevant, incorrect, or superseded. Forgetting is as important as remembering — stale knowledge is worse than no knowledge.

The Governance Plane

The governance plane enforces policies across the entire system.

Components

Policy Engine: Evaluates actions against a set of rules before, during, and after execution. Policies are declarative: “No worker may delete files outside the project directory.” “All database mutations require Level 2 approval.”
Approval Manager: Manages the lifecycle of approval requests. When an action requires human approval, the approval manager creates a request, routes it to the appropriate operator, tracks its status, and unblocks the action when approved.
Audit Logger: Records every significant action, decision, and outcome. The audit log is append-only and tamper-evident. It answers: What happened? When? Why? Who authorized it?
Budget Controller: Tracks and enforces resource budgets. Token consumption, API call counts, execution time, monetary cost. When a budget is exhausted, the controller halts the relevant work and reports.

Policy Evaluation Flow

flowchart LR
  AP[Action Proposed] --> PE[Policy Engine Evaluates]
  PE -->|Allowed| Proceed
  PE -->|Conditional| AM[Approval Manager] --> Wait --> PA[Proceed or Abort]
  PE -->|Denied| BR[Block & Report]

Policy evaluation happens at multiple points: when the planner creates a task (pre-plan), when a worker is about to execute an action (pre-action), and when a result is produced (post-action).

The Operator Interface

The operator interface is how humans interact with the system.

Interface Types

Chat Interface: Natural language conversation. The most accessible but least structured interface.
CLI: Command-line interface for power users and automation. Supports batch operations, scripting, and integration with existing workflows.
API: Programmatic interface for integration with other systems. RESTful or gRPC, with authentication, rate limiting, and versioning.
IDE Plugin: Embedded in the developer’s editor. Context-aware — the plugin knows what file is open, what code is selected, what errors exist.

Interface Responsibilities

All interfaces share the same responsibilities:

Authenticate the operator and establish their permission level.
Accept input (request, command, API call) and pass it to the kernel.
Stream progress back to the operator (plan status, worker output, approval requests).
Present results in the appropriate format for the interface type.
Accept feedback and corrections.

The interface does not make decisions. It is a conduit between the human and the kernel.

The Tool & Skill Layer

Tools are the system’s hands — the mechanisms through which it affects the world.

Tool Categories

File operations: Read, write, create, delete, search files.
Code execution: Run code in sandboxed environments with controlled inputs and outputs.
Search: Full-text search, semantic search, web search.
API integration: HTTP clients for external services, with authentication and rate limiting.
Database access: Query and modify databases within scoped permissions.
Communication: Send messages, create tickets, post comments.

Tool Registry

Tools are registered in a central registry with metadata:

Name and description: What the tool does, in terms a language model can understand.
Schema: Input and output types, required and optional parameters.
Risk level: What governance policies apply when this tool is used.
Cost: Estimated resource consumption per invocation.
Dependencies: What other tools or services this tool requires.

The tool registry is the mechanism by which the system discovers and selects tools at runtime. When a worker needs to perform an action, it queries the registry for tools matching its need, filtered by its sandbox permissions.

The Model Provider Layer

The model provider layer abstracts the language models the system uses.

Abstraction

The system never calls a specific model directly. It calls the model provider with a request specifying:

Task type: Reasoning, generation, classification, embedding.
Quality requirements: High accuracy vs. fast response.
Budget constraints: Maximum tokens, maximum cost.

The model provider selects the appropriate model based on these requirements. A classification task might use a fast, cheap model. A complex planning task might use a large, expensive model. This selection is transparent to the rest of the system.

Provider Interface

sequenceDiagram
  participant K as Kernel
  participant MP as Model Provider
  participant M as Model
  K->>MP: Request(task_type, prompt, constraints)
  MP->>MP: Select model by capability & budget
  MP->>M: Invoke selected model
  M-->>MP: Raw response
  MP-->>K: Response(content, metadata, cost)
  Note over K: metadata: tokens, model, latency, confidence

Metadata includes token counts, model used, latency, and confidence signals. This information feeds back to the budget controller and the kernel’s decision-making.

Putting It Together

A request flows through the architecture as follows:

Operator submits a request through any interface.
Kernel receives the request, runs the intent interpretation pipeline.
Kernel classifies complexity and creates a task graph via the planner.
Governance evaluates the plan against policies. Flags risky steps.
Scheduler orders tasks by priority and dependency.
Process fabric spawns workers for ready tasks, each in a scoped sandbox.
Workers execute tasks using tools, with model provider calls as needed.
Memory plane supplies context to workers and stores results.
Governance monitors execution, gates risky actions.
Kernel consolidates results from completed workers.
Operator receives the final result through the interface.

Each step is logged, auditable, and policy-governed. The system is not a black box — it is a transparent pipeline with inspection points at every stage.

What This Architecture Is Not

This architecture is not a product specification. It does not prescribe technology choices (which database, which queue, which language). It does not mandate a deployment topology (monolith, microservices, serverless). These are implementation decisions that depend on context.

What it provides is a structural guarantee: if your system has these components with these interfaces, it will support the patterns, governance, and operational modes described in this book. The next chapter examines where to draw the boundaries between these components.

Component Boundaries

The reference architecture identifies subsystems. This chapter examines where to draw the lines between them — and why those lines matter more in agentic systems than in traditional software.

Why Boundaries Matter

In a traditional application, a poorly drawn boundary causes maintenance pain: tangled dependencies, hard-to-test modules, deployment bottlenecks. In an agentic system, a poorly drawn boundary causes behavioral failures: a worker that cannot do its job because it lacks access to a tool it needs, a governance policy bypassed because it was enforced in the wrong layer, a memory leak that degrades reasoning quality over time.

Boundaries in an agentic system are not just architectural niceties. They are the mechanism that enables isolation, security, composability, and independent evolution of components.

The Kernel Boundary

The cognitive kernel sits at the center of the system, but it must not become the center of everything.

What belongs in the kernel

Intent interpretation and classification.
Plan creation and adaptation.
Task scheduling and prioritization.
Result consolidation.
Worker lifecycle decisions (spawn, terminate, restart).

What does not belong in the kernel

Domain logic. The kernel should not know how to write code, analyze data, or draft documents. That knowledge belongs in workers.
Tool invocation. The kernel delegates to workers, which invoke tools. The kernel never directly calls a file system API or a database.
Long-running state. The kernel coordinates but does not hold state. State belongs in the memory plane.

The test

If the kernel needs to be modified when you add a new domain (e.g., supporting legal document review in addition to code generation), the boundary is wrong. The kernel should be domain-agnostic. New domains are added by registering new tools, skills, and workers — not by changing the kernel.

The Process Boundary

Each worker process has a boundary defined by its sandbox.

The Sandbox as Contract

The sandbox is not just a security mechanism — it is a contract between the kernel and the worker. The contract states:

What you can access: These files, these tools, these APIs.
What you must produce: An output conforming to this schema.
What you must not do: Access anything outside the sandbox.
What resources you have: This many tokens, this much time.

This contract enables the kernel to reason about workers without understanding their internals. A code worker and a research worker have different sandboxes but the same contractual interface: take input, produce output, respect constraints.

Inter-Process Communication

Workers should communicate through the kernel, not directly with each other. Direct worker-to-worker communication creates hidden dependencies, bypasses governance, and makes the system’s behavior opaque.

When worker A needs information from worker B, the flow is:

sequenceDiagram
  participant A as Worker A
  participant K as Kernel
  participant B as Worker B
  A->>K: Reports need
  K->>B: Retrieves output
  B-->>K: Returns result
  K-->>A: Delivers data

This is slightly less efficient than direct communication but dramatically more observable and controllable. The kernel can log the exchange, apply policies, and maintain a complete picture of information flow.

Exceptions

Pipeline patterns, where the output of one worker feeds directly into the next, can use direct handoff for efficiency — but the kernel must be notified and the handoff must be logged.

The Memory Boundary

Memory is shared infrastructure, but access must be scoped.

Memory Scoping Rules

Working memory is private to the worker that holds it. No other worker reads or writes to it directly.
Episodic memory is scoped by project, team, or task. A worker on project A does not see episodic memories from project B unless explicitly granted.
Semantic memory is broadly accessible but read-preferences vary. A code worker retrieves code-relevant knowledge. A documentation worker retrieves writing guidelines. The retrieval query, not access control, provides the natural scoping.
Procedural memory is scoped by domain. Code strategies are available to code workers. Research strategies are available to research workers.

The Memory API

All memory access goes through a unified API:

flowchart LR
  W[Worker] --> API[Memory API]
  API -->|"store(content, tier, scope, metadata)"| DB[(Storage)]
  API -->|"retrieve(query, tier, scope, filters)"| DB
  API -->|"update(memory_id, content)"| DB
  API -->|"forget(memory_id)"| DB
  DB -->|memory_id / results| API
  API --> W

Workers never access the underlying storage directly. This abstraction allows the memory system to change its implementation — switching from a local vector database to a distributed one, for example — without affecting any worker.

The Governance Boundary

Governance sits above everything. Its boundary is defined by the principle: governance cannot be bypassed.

Enforcement Points

The governance plane has hooks at every boundary crossing:

Kernel → Worker: When the kernel spawns a worker, governance verifies the sandbox configuration, tool permissions, and budget allocation.
Worker → Tool: When a worker invokes a tool, governance evaluates the action against policies before the tool executes.
Worker → Memory: When a worker writes to memory, governance checks whether the data classification allows it.
Worker → External: When a worker communicates with an external service, governance verifies the endpoint is allowed and the data leaving the system is permitted.

Governance as Middleware

Governance is best implemented as middleware that wraps every boundary crossing, not as a monolithic checkpoint at the entrance. Monolithic governance catches policy violations at the front door but misses them inside the house. Middleware governance catches them everywhere.

flowchart LR
  subgraph without["Without Middleware"]
    direction LR
    R1[Request] --> PC1[Policy Check] --> K1[Kernel] --> W1[Workers] --> T1[Tools] --> D1[Done]
  end
  subgraph with["With Middleware"]
    direction LR
    R2[Request] --> PC2[Policy] --> K2[Kernel] --> PC3[Policy] --> W2[Worker] --> PC4[Policy] --> T2[Tool] --> D2[Done]
  end

The Tool Boundary

Tools are the system’s interface with the external world. Their boundaries must be crisp.

Tool Interface Requirements

Every tool must:

Declare its inputs and outputs: Types, formats, constraints. The system must know what to send and what to expect back.
Declare its side effects: Does this tool read only, or does it modify state? Does it communicate externally? The governance plane needs this information.
Handle its own errors: A tool failure should not crash the worker. The tool returns a structured error that the worker can interpret and handle.
Respect timeouts: Tools that hang block workers. Every tool invocation must have a timeout, with clean failure on expiry.
Be idempotent when possible: Running the same tool twice with the same input should produce the same result. This simplifies retry logic.

Tool Composition

Complex capabilities are built by composing simple tools. “Deploy the service” is not a single tool — it is a sequence: build, test, push image, update config, roll out. Each step is a tool. The composition logic lives in the worker, not in the tools themselves.

This means tools should be small, focused, and composable. A tool that does too much is hard to compose, hard to scope, and hard to govern.

The Operator Boundary

The boundary between the operator (human) and the system is the most important one to get right.

What Crosses the Boundary

Inward: Requests, approvals, corrections, feedback, configuration.
Outward: Results, progress updates, approval requests, error reports, audit summaries.

What Does Not Cross

Implementation details. The operator does not need to see which model was used for step 3 of subtask 7 (unless they ask).
Internal state. The operator sees the plan and the results, not the raw context windows.
Transient failures. If a retry succeeds, the operator does not need to know about the initial failure (though it should be logged for audit).

The Transparency Dial

Different operators want different levels of visibility. A developer debugging the system wants to see everything. An end user wants to see the result. The operator boundary should support a transparency dial — from minimal (just the answer) to maximal (every decision, every action, every policy evaluation).

Anti-Patterns in Boundary Design

The God Kernel

A kernel that does everything: parses requests, executes domain logic, calls tools, manages memory, enforces policies. This kernel is impossible to test, impossible to extend, and impossible to debug.

The Leaky Sandbox

A worker sandbox with undeclared access. The worker “happens to” have access to the production database because the sandbox was configured too broadly. This is a security incident waiting to happen.

The Bypass Channel

A direct connection between a worker and an external service that skips governance. “It was faster to call the API directly.” Faster, yes — until the API returns sensitive data that the worker should not have seen.

The Shared Memory Free-for-All

All workers read and write to the same memory space without scoping. Worker A writes a conclusion. Worker B reads it, misinterprets it, and acts on it. Worker A never intended the conclusion to be shared. The result is chaos.

Drawing Good Boundaries

Good boundaries follow three principles:

Each component has one reason to change. The kernel changes when orchestration logic changes. The memory plane changes when storage requirements change. Tools change when external APIs change. They do not change for each other’s reasons.
Every boundary crossing is explicit and observable. No hidden channels, no implicit dependencies. If component A uses component B, there is an interface, a contract, and a log entry.
Boundaries enforce the minimum necessary coupling. Components know about each other’s interfaces, never about each other’s internals. A tool does not know what worker is calling it. A worker does not know what model the kernel selected for planning.

These principles are not new. They are the same principles that make operating systems, microservices, and well-designed libraries work. The Agentic OS applies them to a new domain — but the engineering discipline is timeless.

Skills, Operators, and Reusable Building Blocks

An operating system is only as useful as the programs that run on it. Linux without packages is a curiosity. Windows without applications is an expensive boot screen.

The same is true for the Agentic OS. The kernel, the process fabric, the memory plane, the governance plane — these are infrastructure. What makes the system useful are the skills, operators, and building blocks that run on top of this infrastructure.

Skills

A skill is a packaged capability: a bundle of instructions, tools, and strategies that enables the system to perform a specific type of work.

Anatomy of a Skill

A skill consists of:

Instructions: Guidance for the language model when performing this type of work. What to prioritize, what to avoid, what patterns to follow, what quality standards to apply.
Tools: The specific tools required. A coding skill needs file access, code execution, and test runners. A research skill needs search, web access, and citation tools.
Strategies: Procedural knowledge about how to approach the work. “When writing unit tests, start with the happy path, then edge cases, then error cases.” These are the accumulated best practices for the domain.
Validation criteria: How to verify the skill’s output is correct. Code should compile and pass tests. Research should cite sources. Writing should meet style guidelines.

Skill Registration

Skills are registered in the system’s skill registry, which the kernel queries to match capabilities to tasks:

flowchart TD
  subgraph Skill["Skill: python-backend"]
    direction TB
    Desc["Develop Python backend services"]
    Domains["Domains: code, python, backend"]
    Tools["Tools: file_read, file_write,\npython_exec, pytest, pip"]
    Inst["Instructions: PEP 8, type hints,\ntests for public functions"]
    subgraph Strategies
      NE["new-endpoint:\ndefine route \u2192 implement \u2192\nvalidate \u2192 test \u2192 docs"]
      FB["fix-bug:\nreproduce \u2192 root cause \u2192\nfailing test \u2192 fix \u2192 verify"]
    end
  end
  K[Kernel] -->|queries| Registry[(Skill Registry)]
  Registry --> Skill

Skill Composition

Skills can compose. A “full-stack feature” task might invoke the python-backend skill for the API, a react-frontend skill for the UI, and a postgres-database skill for the schema. The kernel selects and combines skills based on the task requirements.

Composition is horizontal (multiple skills for different aspects of one task) and vertical (a high-level skill delegates to lower-level skills). A deploy skill might internally use docker-build, kubernetes-apply, and health-check skills.

Skill Quality

Not all skills are equal. A skill’s quality depends on:

Instruction clarity: Vague instructions produce vague results. “Write good code” is not a skill. “Write Python code following PEP 8 with type hints and 80% test coverage” is.
Strategy completeness: Skills with well-defined strategies for common scenarios outperform skills that rely on the model to figure out the approach.
Tool fitness: Skills that provide the right tools — not too many, not too few — enable focused execution.
Validation robustness: Skills with strong validation criteria catch errors early.

Operators

In the Agentic OS, an operator is a human role — the person or team that interacts with the system. But “operator” also describes a reusable pattern for how humans and systems collaborate.

Operator Profiles

Different operators have different needs, permissions, and interaction styles. An operator profile captures these:

Permissions: What actions this operator can authorize. What data they can access. What budgets they control.
Preferences: How much detail they want in responses. What format they prefer. How often they want progress updates.
Trust level: How much autonomy the system has when acting on this operator’s behalf. New operators start with lower trust. Established operators earn higher trust.
Context: What projects, repositories, and systems this operator works with. This context accelerates intent interpretation.

Operator Adaptation

The system adapts to its operators over time. It learns:

That this developer always wants verbose error messages.
That this manager prefers summaries over details.
That this team reviews PRs within an hour, so approval gates have a short expected wait time.
That this operator never approves deletions without seeing a backup confirmation.

These adaptations are stored in the memory plane and applied automatically, reducing friction with every interaction.

Reusable Building Blocks

Below skills and operators, the system is built from composable building blocks — small, well-defined units of functionality that can be combined to create new capabilities.

Prompt Templates

Reusable prompt structures for common operations:

Analysis template: “Given {context}, analyze {target} for {criteria}. Report findings as {format}.”
Generation template: “Given {requirements} and {constraints}, generate {artifact}. Validate against {criteria}.”
Review template: “Review {artifact} against {standards}. List issues by severity. Suggest fixes.”

Templates are not rigid scripts — they are starting points that the system customizes based on context. But they encode best practices: the order of information, the type of output, the validation step.

Workflow Patterns

Reusable orchestration patterns that combine multiple steps:

Generate-Test-Fix: Generate an artifact, test it, fix issues, repeat until tests pass. Used for code, configurations, and data transformations.
Research-Synthesize-Present: Gather information from multiple sources, synthesize findings, present a coherent summary. Used for analysis, due diligence, and decision support.
Draft-Review-Revise: Produce a draft, review against criteria, revise based on feedback. Used for documents, designs, and proposals.
Monitor-Alert-Respond: Continuously observe a system, detect anomalies, trigger appropriate responses. Used for operations, security, and compliance.

Context Assemblers

Reusable logic for building the right context for a task:

Code context assembler: Given a target file, gather the file itself, its imports, its tests, its recent changes, and the project’s coding conventions.
Project context assembler: Gather the project’s README, architecture docs, dependency list, and active issues.
User context assembler: Gather the user’s preferences, recent interactions, and active tasks.

Context assembly is critical — the quality of the system’s output is directly proportional to the quality of the context it receives. Good assemblers produce focused, relevant context. Poor assemblers dump everything into the context window and hope for the best.

Validators

Reusable validation functions that verify output quality:

Code validators: Compile, lint, type-check, run tests.
Document validators: Check word count, readability score, required sections, citation completeness.
Data validators: Schema conformance, range checks, referential integrity.
Security validators: Check for common vulnerabilities, credential exposure, injection risks.

Validators are the system’s quality assurance layer. Every building block that produces output should have a corresponding validator.

The Package Ecosystem

Skills, templates, workflows, assemblers, and validators are all packages — distributable, versioned units of functionality.

Package Structure

A package contains:

Metadata: Name, version, description, author, dependencies.
Assets: Instructions, prompt templates, tool configurations, validation rules.
Tests: Automated tests that verify the package works correctly.
Documentation: What the package does, how to use it, what it requires.

Package Lifecycle

Packages are developed, tested, published, installed, and updated independently. A team can develop a custom skill for their specific domain, test it against their codebase, and publish it for others to use. Updates are versioned, so the system can pin to a specific version and upgrade deliberately.

Composability Principles

For building blocks to compose well, they must follow principles:

Single responsibility: Each block does one thing.
Declared dependencies: A block explicitly states what it needs.
Standard interfaces: Blocks communicate through shared formats and protocols.
Self-describing: A block carries enough metadata for the system to discover and use it without human explanation.

Building vs. Configuring

The most powerful aspect of this building-block architecture is that most customization is configuration, not code.

To add support for a new programming language, you do not modify the kernel. You register a new skill with language-specific instructions, tools, and validators.

To support a new team’s workflow, you do not rebuild the process fabric. You define an operator profile with the right permissions, preferences, and trust levels.

To create a new type of analysis, you do not write new orchestration logic. You compose existing templates, workflows, and validators into a new skill.

The infrastructure is general. The building blocks are specific. This separation is what makes the system extensible without being fragile — and reusable without being rigid.

Extensibility and Evolution

A system that cannot change is a system that is dying. Software, organizations, and ecosystems all follow the same rule: adapt or become irrelevant. The Agentic OS is designed to evolve — not through heroic rewrites but through incremental extension.

This chapter examines the mechanisms that make an Agentic OS extensible and the strategies that allow it to evolve gracefully over time.

The Extensibility Imperative

Agentic systems face a uniquely aggressive extensibility challenge. Traditional software extends in response to new requirements — quarterly releases, annual roadmaps. Agentic systems must extend in response to new capabilities that emerge continuously:

New models appear monthly, each with different strengths and context sizes.
New tools and APIs proliferate daily.
New domains become automatable as capabilities improve.
New governance requirements emerge as regulations evolve.
New operators join with different needs and workflows.

A system that requires a rewrite to accommodate a new model or a new tool is not an Agentic OS — it is a demo.

Extension Points

The reference architecture provides extension points at every layer:

Model Provider Extensions

Adding a new language model should require nothing more than implementing the provider interface:

classDiagram
  class ModelProvider {
    +name: string
    +request(prompt, constraints) response
  }
  class Model {
    +name: string
    +capabilities: string[]
    +context_window: int
    +cost_per_token: float
  }
  ModelProvider --> "1..*" Model : provides
  class Kernel {
    +selectProvider(requirements)
  }
  Kernel --> ModelProvider : uses

The kernel does not know or care which model is running. It knows what capabilities it needs and what budget it has. The model provider layer handles the rest.

Tool Extensions

New tools plug into the tool registry:

classDiagram
  class Tool {
    +name: string
    +description: string
    +risk_level: string
    +requires: string[]
  }
  class JiraIntegration {
    +create_issue(project, type, summary, desc) issue_key
    +update_issue(issue_key, fields) void
    +search_issues(query) issues[]
    +risk_level: medium
    +requires: network, jira_credentials
  }
  Tool <|-- JiraIntegration
  class ToolRegistry {
    +register(tool) void
    +discover(capability) tool[]
  }
  ToolRegistry --> "*" Tool

Once registered, the tool is immediately available to workers whose sandboxes grant access to it. No kernel changes, no process fabric changes, no redeployment of the core system.

Skill Extensions

New skills are packages installed into the skill registry. They bring their own instructions, tool requirements, strategies, and validators. The kernel discovers them through the registry and uses them when task classification matches their domain.

Policy Extensions

New governance policies are added to the policy engine:

flowchart LR
  Act[Worker Action] --> PE[Policy Engine]
  PE --> P1["Existing Policies"]
  PE --> P2["New Policy:\nno-external-api-unencrypted"]
  P2 --> Cond{"tool.type == http_client\nAND NOT tls?"}
  Cond -->|Yes| Deny["Deny + message:\nAll external API calls must use TLS"]
  Cond -->|No| Allow[Allow]
  P1 --> Allow

Policies are additive. Adding a new policy does not require modifying existing ones. The policy engine evaluates all applicable policies for each action.

Memory Extensions

New memory backends can be added without disrupting the memory API. The system might start with a local SQLite-based episodic memory and later switch to a distributed PostgreSQL cluster. The memory API abstraction ensures no worker is affected by the change.

Evolution Strategies

Extensibility is the mechanism. Evolution strategy governs how the mechanism is used over time.

Evolutionary Architecture

The Agentic OS follows the principle of evolutionary architecture: the system’s structure supports guided, incremental change. Key practices:

Fitness functions: Automated checks that verify the system still meets its design goals after a change. “Latency for simple requests is under 2 seconds.” “All production actions are audit-logged.” “No worker has access to tools outside its skill definition.”
Sacrificial architecture: Some components are designed to be replaced. The first memory implementation is not the last. Build it to be replaced, and replacing it will be painless.
Evolutionary pressure: Track which components change most frequently. Frequent changes indicate either a volatile domain (expected) or poor boundaries (fix it).

Versioning Strategy

Every extension point supports versioning:

Skills are versioned. A task can pin to a specific version: “Use python-backend v2.3 for this project.” New versions are opt-in until proven stable.
Policies are versioned and timestamped. The audit log records which policy version was applied at each decision point.
Tools are versioned. Tool API changes are handled through version negotiation between the tool registry and workers.
Model providers are versioned. Model upgrades are rolled out gradually, with A/B testing against quality benchmarks.

Migration Patterns

When a component must be replaced rather than extended, the system supports migration:

Parallel run: The old and new components run simultaneously. Results are compared. When the new component matches or exceeds the old, traffic is shifted.
Shadow mode: The new component processes all requests but its results are discarded. Only the old component’s results are used. This validates the new component without risk.
Gradual rollout: The new component handles an increasing percentage of requests. Metrics are monitored at each step.
Feature flags: New capabilities are gated behind flags. They can be enabled per operator, per project, or per task type.

The Plugin Architecture

For maximum extensibility, the Agentic OS supports a plugin model where third parties can extend the system without modifying its core.

Plugin Types

Skill plugins: Add new domains of expertise.
Tool plugins: Add new integrations and capabilities.
Policy plugins: Add new governance rules for specific compliance regimes.
Interface plugins: Add new ways to interact with the system (Slack bot, email, custom UI).
Memory plugins: Add specialized memory backends (graph databases, time-series stores).

Plugin Safety

Plugins introduce code from outside the system’s trust boundary. Safety measures include:

Sandboxing: Plugins run in isolated environments. A malicious plugin cannot access the kernel’s memory or another plugin’s data.
Capability declaration: Plugins must declare what resources they need. Undeclared access is denied.
Review and audit: Plugins are reviewed before installation. Their behavior is audited during operation.
Revocation: Plugins can be disabled or removed at any time without affecting the core system.

Backward Compatibility

Evolution must not break existing functionality. The system maintains backward compatibility through:

Interface stability: Published interfaces do not change within a major version. New capabilities are added as new interfaces, not modifications of existing ones.
Deprecation process: Old interfaces are marked deprecated, supported for a defined period, then removed. Workers using deprecated interfaces receive warnings.
Compatibility layers: When an interface must change, a compatibility layer translates between old and new formats, allowing gradual migration.

The Cost of Extensibility

Extensibility is not free. Every extension point adds:

Indirection: The code path from request to execution passes through more abstraction layers, making debugging harder.
Testing surface: Each extension point multiplies the test matrix. N tools × M skills × P policies = N×M×P combinations to validate.
Documentation burden: Every extension point must be documented well enough for third parties to use correctly.
Performance overhead: Abstraction layers add latency. Registry lookups take time. Policy evaluations consume compute.

The goal is not maximum extensibility but appropriate extensibility. Extension points where change is expected (tools, skills, models) should be deeply extensible. Internal implementation details that are unlikely to change (the kernel loop, the scheduler algorithm) can be simpler and more direct.

Evolution as a First-Class Concern

In traditional systems, extensibility is an afterthought — the team adds plugin support in version 3 after years of monolithic development. In the Agentic OS, extensibility is a founding principle because the domain demands it.

The landscape of AI models, tools, and applications changes faster than any single system can be rewritten. The only viable strategy is to build a system that expects change and makes change cheap. The Agentic OS does this through standard interfaces, pluggable components, versioned extensions, and evolutionary migration strategies.

The system that survives is not the most capable at launch. It is the one that can absorb new capabilities the fastest.

Performance and Efficiency

Intelligence is expensive. Every model call costs money, every token consumes compute, every tool invocation takes time. An Agentic OS that produces excellent results but takes an hour and fifty dollars to fix a typo is not a useful system. Performance and efficiency are not optimizations to be added later — they are design constraints that shape the architecture from the start.

The Cost Model

Understanding performance in an agentic system requires understanding where resources are consumed:

Token Budget

The dominant cost in most agentic systems is language model invocations. Each call consumes input tokens (context) and output tokens (response). The cost equation:

$$ \text{Total cost} = \sum (\text{input_tokens} \times \text{input_price} + \text{output_tokens} \times \text{output_price}) $$

For a task that involves planning (one call), decomposition (one call), five workers (five calls each for execution and checking), and consolidation (one call), you might be looking at 13+ model calls. If each call uses 10,000 input tokens and 2,000 output tokens, that is 130,000 input tokens and 26,000 output tokens. At current prices, this is dollars, not cents.

Latency

Sequential model calls dominate latency. A model call might take 2-10 seconds. Thirteen sequential calls take 30-130 seconds. For interactive use, this is painful. For batch processing, it is acceptable.

Tool Invocation

Some tools are fast (file read: milliseconds). Some are slow (web search: seconds). Some are very slow (code execution with compilation: tens of seconds). Tool latency compounds with the number of steps.

Coordination Overhead

Every message between the kernel and a worker, every memory retrieval, every policy evaluation adds overhead. In a multi-agent system with dozens of workers, coordination overhead can exceed the cost of actual work.

Optimization Strategies

Context Window Efficiency

The most impactful optimization is using context windows efficiently. Every unnecessary token is wasted money and diluted attention.

Context pruning: Include only what the worker needs. A worker fixing a bug in one function does not need the entire codebase. It needs the function, its callers, its tests, and the error report. Aggressive pruning reduces cost and improves quality — less noise means better signal.

Summarization: For large artifacts that must be referenced, use summaries rather than full text. The codebase summary, not the full codebase. The meeting notes, not the full transcript.

Tiered context: Start with minimal context. If the worker requests more (or fails due to insufficient context), expand incrementally. This is the “lazy loading” strategy for context.

Context caching: When multiple workers need the same context (e.g., the project’s coding conventions), assemble it once and share the result. Many model providers support prompt caching that reduces cost for repeated prefixes.

Model Selection

Not every task needs the most powerful model.

Classification (routing requests, assessing risk) can use fast, cheap models.
Short generation (variable names, commit messages) can use small models.
Complex reasoning (planning, architecture decisions) needs capable models.
Embeddings (memory retrieval) use embedding-specific models at a fraction of the cost.

The model provider layer should support automatic model selection based on task requirements. The kernel specifies what it needs (reasoning depth, output length, speed priority), and the provider selects appropriately.

Parallelism

Sequential execution is the enemy of latency. When the task graph allows it, execute in parallel:

Independent subtasks run simultaneously. If the plan has five independent steps, all five workers execute at once. Wall-clock time: the slowest worker, not the sum of all workers.
Pre-fetch predictable needs. While a worker is executing, pre-assemble context for the next likely step.
Speculative execution. Start work on the most likely next step before the current step confirms it is needed. Discard if wrong. This trades compute cost for latency.

Caching

Many operations in an agentic system are repetitive:

The same file is read by multiple workers.
The same tools are invoked with similar parameters.
The same governance policies are evaluated against similar actions.
The same memory queries return similar results.

A caching layer at each boundary can eliminate redundant work:

Tool result cache: Cache tool outputs keyed by (tool, parameters, timestamp). Invalidate when the underlying data changes.
Memory cache: Cache frequent memory retrievals. Invalidate on writes.
Policy cache: Cache policy evaluation results for identical (action, context) pairs within a time window.
Model cache: Cache model responses for identical prompts. Useful for deterministic tasks like classification.

Batching

Instead of invoking a tool once per item, batch:

Read ten files at once, not ten separate reads.
Run all tests in one invocation, not one test per invocation.
Evaluate five policy rules against one action in one pass, not five passes.

Batching reduces round-trip overhead and often enables more efficient processing.

Early Termination

Not every plan needs to run to completion. The kernel should detect when:

The goal has been achieved before all steps are complete (remaining steps were contingency paths).
The result is good enough and further refinement has diminishing returns.
The task is impossible and continuing wastes resources.

Early termination saves the most expensive resource: time.

Efficiency Metrics

You cannot optimize what you do not measure. The Agentic OS should track:

Tokens per task: Total input and output tokens consumed for each task type. Trend this over time.
Cost per task: Monetary cost broken down by model calls, tool invocations, and coordination.
Latency per task: Wall-clock time from request to result. Break down into planning, execution, checking, and consolidation phases.
Retry rate: How often steps are retried. High retry rates indicate either poor initial execution or poor error handling.
Context utilization: What percentage of each context window is actually relevant to the task. Low utilization means poor context assembly; the system is paying for tokens the model ignores.
Cache hit rate: How often caches prevent redundant work. Low hit rates mean the cache strategy needs revision.

Efficiency Dashboards

Operators should have visibility into efficiency metrics:

Which task types are most expensive?
Which skills consume the most tokens?
Where does latency accumulate?
How does cost correlate with result quality?

This visibility enables informed decisions about where to invest in optimization.

The Quality-Cost Frontier

Performance optimization in agentic systems is not about minimizing cost — it is about maximizing the quality-to-cost ratio. A system that produces mediocre results cheaply is worse than one that produces excellent results at moderate cost.

The quality-cost frontier describes the tradeoff:

quadrantChart
  title Quality-Cost Frontier
  x-axis "Low Cost" --> "High Cost"
  y-axis "Low Quality" --> "High Quality"
  quadrant-1 "Over-engineered"
  quadrant-2 "Optimal"
  quadrant-3 "Good enough"
  quadrant-4 "False economy"
  "Cheap but poor": [0.15, 0.2]
  "Good enough": [0.35, 0.5]
  "Optimal": [0.55, 0.78]
  "Over-engineered": [0.85, 0.88]

Different tasks have different optimal points on this frontier:

Mission-critical code changes: High quality justifies high cost. Use the best model, thorough validation, multiple review passes.
Routine formatting: Low cost is essential. Use the cheapest model, minimal validation, no review.
Exploratory research: Moderate cost, with the budget allocated to breadth (many searches) rather than depth (expensive model calls).

The kernel’s task classifier should map each task to its appropriate quality-cost target.

Resource Governance

The budget controller in the governance plane enforces resource limits:

Per-task budgets: Maximum tokens and cost per individual task. Prevents runaway processes.
Per-session budgets: Maximum spend per interaction session. Gives operators predictable costs.
Per-period budgets: Maximum spend per day, week, or month. Prevents budget surprises.
Burst limits: Allow short bursts above the per-task limit for complex operations, with a refill rate.

When a budget is exhausted at any level, the system must make an explicit decision: stop the work, request a budget increase from the operator, or proceed with a cheaper approach.

Budget exhaustion should never be silent. The system reports what it could not complete and why, giving the operator the information to decide whether to allocate more resources.

The Efficiency Flywheel

The most powerful efficiency strategy is the learning loop. As the system operates, it accumulates knowledge that makes future operations cheaper:

Learned context: The system knows which files are relevant to which tasks, reducing context assembly cost.
Cached strategies: The system knows which approach works for which task type, reducing planning cost.
Calibrated models: The system knows which model is needed for which task, avoiding over-provisioning.
Refined policies: The system knows which actions are always safe, reducing governance overhead.

Each task the system successfully completes makes the next similar task cheaper and faster. This is the efficiency flywheel: performance improves with use, not just with engineering effort.

An Agentic OS that lacks these performance mechanisms will work — but it will work slowly and expensively. In a world where intelligence is a commodity, the differentiator is not capability but the efficiency with which capability is applied.

Coding OS

The most natural application of the Agentic OS model is software development itself. Developers already think in processes, contexts, tools, and workflows. The mental model transfers directly. This chapter walks through a Coding OS — an Agentic OS specialized for building, maintaining, and evolving software.

The Domain

Software development is uniquely suited for agentic systems because it has:

Formal verification: Code either compiles or it does not. Tests either pass or they do not. There is ground truth.
Rich tooling: Compilers, linters, test runners, debuggers, version control — a deep ecosystem of tools that can be invoked programmatically.
Structured artifacts: Source code, configuration files, schemas, tests — all machine-readable.
Clear workflows: Feature development, bug fixing, code review, deployment — well-defined processes with known steps.
Measurable quality: Test coverage, type safety, lint compliance, performance benchmarks — quantifiable outcomes.

These properties make software development a domain where agentic systems can operate with high autonomy and measurable results.

Architecture

The Coding OS instantiates the reference architecture with domain-specific components:

Cognitive Kernel

The kernel understands software development intents:

“Fix this bug” → Reproduce, diagnose, patch, test, verify.
“Add this feature” → Understand requirements, design, implement, test, document.
“Review this PR” → Read changes, check quality, verify tests, assess risk, provide feedback.
“Refactor this module” → Analyze dependencies, plan changes, execute incrementally, verify behavior preservation.

Each intent maps to a known decomposition strategy with domain-specific success criteria.

Process Fabric

Workers are specialized by development activity:

Coder: Writes and modifies source code. Has access to file operations, language servers, and code execution.
Tester: Writes and runs tests. Has access to test frameworks, coverage tools, and assertion libraries.
Reviewer: Analyzes code quality. Has access to linting tools, static analysis, and project conventions.
Debugger: Diagnoses failures. Has access to logs, stack traces, breakpoint tools, and runtime inspection.
Documenter: Writes and updates documentation. Has access to doc generators, README templates, and API references.

Each worker type has a scoped sandbox. The coder can write files but not deploy. The reviewer can read but not modify. The tester can execute code in sandboxes but not in production.

Memory Plane

The Coding OS memory plane includes:

Codebase map: A structural understanding of the project — modules, dependencies, entry points, hot paths. Updated on every significant change.
Convention memory: The project’s coding standards, patterns, and anti-patterns. Learned from existing code and explicit configuration.
Bug history: Past bugs, their root causes, and their fixes. Used to inform diagnosis of new bugs (“this module had a race condition last month — check for similar issues”).
Review history: Past review feedback and recurring issues. Used to pre-check code before it reaches a human reviewer.
Deployment history: Past deployments, their outcomes, and any incidents. Used to assess risk of new changes.

Governance

Coding-specific policies:

No direct production access: Workers cannot modify production systems. Deployment requires explicit approval.
Test coverage gates: New code must meet minimum test coverage thresholds.
Review requirements: Changes above a complexity threshold require human review before merge.
Dependency policies: New dependencies must be from approved sources and pass security scanning.
Branch protection: Workers operate on feature branches. Main branch modifications require approval.

Workflow: Feature Development

A complete feature development workflow in the Coding OS:

1. Intent Interpretation

The operator says: “Add pagination to the user list API endpoint.”

The kernel interprets:

Surface intent: Add pagination.
Operational intent: Modify the existing endpoint, not create a new one. Use cursor-based or offset-based pagination consistent with other endpoints.
Boundary intent: Do not change the existing response format for non-paginated requests. Maintain backward compatibility.

2. Decomposition

The kernel produces a task graph:

flowchart TD
  T1["1. Analyze existing endpoint\n(Coder)"]
  T2["2. Check pagination patterns\n(Coder)"]
  T3["3. Design pagination approach\n(Coder)"]
  T4["4. Implement query changes\n(Coder)"]
  T5["5. Implement response formatting\n(Coder)"]
  T6["6. Update API documentation\n(Documenter)"]
  T7["7. Write unit tests\n(Tester)"]
  T8["8. Write integration tests\n(Tester)"]
  T9["9. Run full test suite\n(Tester)"]
  T10["10. Code review\n(Reviewer)"]

  T1 --> T4
  T2 --> T4
  T3 --> T4
  T3 --> T5
  T3 --> T6
  T4 --> T7
  T5 --> T7
  T4 --> T8
  T5 --> T8
  T7 --> T9
  T8 --> T9
  T4 --> T10
  T5 --> T10
  T6 --> T10

Steps 1, 2 run in parallel. Steps 4, 5, 6 run in parallel after 3 completes. Steps 7, 8 run in parallel.

3. Execution

Each worker executes its step with focused context:

The coder analyzing the existing endpoint gets the endpoint file, the router configuration, and the database query layer.
The coder checking pagination patterns gets examples of pagination from other endpoints in the project.
The tester writing unit tests gets the implementation, the test framework patterns used in the project, and the success criteria.

4. Verification

The check phase runs at multiple levels:

Does the code compile? (Automated)
Do the new tests pass? (Automated)
Do all existing tests still pass? (Automated — regression check)
Does the implementation match the pagination patterns used elsewhere? (Reviewer)
Is the documentation accurate? (Reviewer)

5. Result

The operator receives a ready-to-merge branch with:

Implementation across the necessary files.
Tests with passing results.
Updated documentation.
A summary of what was done and why specific decisions were made.

Workflow: Bug Fixing

Bug fixing follows a different decomposition:

1. Reproduction

The debugger worker attempts to reproduce the bug. It reads the bug report, identifies the relevant code path, writes a failing test that demonstrates the bug, and confirms the test fails.

If reproduction fails, the system escalates: “I could not reproduce this bug. Here is what I tried. Can you provide additional context?”

2. Diagnosis

With a reliable reproduction, the debugger analyzes the failing test:

What is the expected behavior?
What is the actual behavior?
Where does the code path diverge from expectation?

The debugger uses the bug history memory: “A similar symptom in this module was caused by a missing null check in v2.3.”

3. Fix

The coder writes the fix, constrained by:

Minimality: Change as little as possible.
Safety: Do not introduce new failure modes.
Consistency: Follow existing patterns.

4. Verification

The tester verifies:

The failing test now passes.
No existing tests broke.
Edge cases related to the fix are covered.

5. Prevention

The system updates its memory: “Bug in user lookup caused by case-sensitive comparison. Added to convention memory: always use case-insensitive comparison for email fields.”

The IDE Integration

The Coding OS is most powerful when integrated into the developer’s IDE:

Context awareness: The system knows what file is open, what line the cursor is on, what errors are highlighted, what branch is checked out.
Inline suggestions: Instead of a separate chat, the system provides suggestions inline — fix proposals next to errors, test suggestions next to new functions.
Background operations: The system runs continuous checks in the background — linting, security scanning, convention compliance — and surfaces issues proactively.
Progressive disclosure: Simple fixes are applied with one click. Complex changes are previewed as diffs. Major refactors are presented as plans for review.

Metrics

A Coding OS should track its own effectiveness:

Fix success rate: What percentage of bug fixes pass review on the first attempt?
Feature completion rate: What percentage of features are delivered without re-work?
Time to resolution: How long from request to merged PR?
Regression rate: How often do changes introduce new bugs?
Cost per task: How much does it cost (in tokens, model calls, time) to complete each task type?

These metrics feed back into the system’s learning loop, improving decomposition strategies, context assembly, and model selection over time.

What Makes This an OS, Not a Tool

A code generation tool writes code. A Coding OS develops software. The difference is the full lifecycle: understanding intent, planning work, coordinating specialists, verifying quality, learning from outcomes, and adapting over time.

The tool answers: “What code should I generate?” The OS answers: “How should this software be built?”

That is the shift from tool to operating system, applied to the domain where it is most natural.

Reference Implementation

This section provides a concrete implementation of the Coding OS using the Microsoft Agent Framework (Semantic Kernel) in Python, with MCP servers for tool isolation.

Project Structure

flowchart TD
  subgraph CodingOS["coding-os/"]
    subgraph Agents["agents/"]
      KernelPy["kernel.py\nCognitive kernel"]
      CoderPy["coder.py\nCode generation"]
      TesterPy["tester.py\nTest writing"]
      ReviewerPy["reviewer.py\nCode review"]
    end
    subgraph Plugins["plugins/"]
      FS["filesystem.py\nFile operations"]
      Git["git_plugin.py\nGit operations"]
      Exec["code_exec.py\nCode execution"]
    end
    subgraph Skills["skills/"]
      PB["python_backend.py\nSkill definition"]
    end
    subgraph Governance["governance/"]
      Filters["filters.py\nPolicy enforcement"]
    end
    subgraph MCP["mcp_servers/"]
      MCPFS["filesystem/\nFile read/write"]
      MCPGit["git/\nGit operations"]
    end
    MainPy["main.py\nEntry point"]
  end

Plugins (SK Skills)

In Semantic Kernel, capabilities are exposed as plugins — classes with @kernel_function decorated methods. Each plugin is the equivalent of an operator in the Agentic OS model:

# plugins/filesystem.py
import os
from typing import Annotated
from semantic_kernel.functions import kernel_function

# Governance: restrict to project directory
ALLOWED_ROOT = os.environ.get("PROJECT_ROOT", "/workspace")

class FilesystemPlugin:
    """File operations scoped to the project directory."""

    @kernel_function(description="Read a file from the project directory.")
    def file_read(
        self, path: Annotated[str, "Relative path to the file"]
    ) -> Annotated[str, "File contents"]:
        full_path = os.path.join(ALLOWED_ROOT, path)
        # Security: prevent path traversal
        if not os.path.realpath(full_path).startswith(os.path.realpath(ALLOWED_ROOT)):
            raise PermissionError(f"Access denied: {path} is outside project scope")
        with open(full_path, "r") as f:
            return f.read()

    @kernel_function(description="Write content to a file in the project directory.")
    def file_write(
        self,
        path: Annotated[str, "Relative path to the file"],
        content: Annotated[str, "Content to write"],
    ) -> Annotated[str, "Confirmation message"]:
        full_path = os.path.join(ALLOWED_ROOT, path)
        if not os.path.realpath(full_path).startswith(os.path.realpath(ALLOWED_ROOT)):
            raise PermissionError(f"Access denied: {path} is outside project scope")
        os.makedirs(os.path.dirname(full_path), exist_ok=True)
        with open(full_path, "w") as f:
            f.write(content)
        return f"Written {len(content)} bytes to {path}"

    @kernel_function(description="Search for files matching a glob pattern.")
    def file_search(
        self,
        pattern: Annotated[str, "Glob pattern to search for"],
    ) -> Annotated[list[str], "Matching file paths"]:
        import glob
        matches = glob.glob(
            os.path.join(ALLOWED_ROOT, pattern), recursive=True
        )
        return [os.path.relpath(m, ALLOWED_ROOT) for m in matches[:50]]

# plugins/git_plugin.py
import subprocess
from typing import Annotated
from semantic_kernel.functions import kernel_function

class GitPlugin:
    """Git operations for version control."""

    @kernel_function(description="Get the diff of current changes.")
    def git_diff(
        self, ref: Annotated[str, "Git reference to diff against"] = "HEAD"
    ) -> Annotated[str, "The diff output"]:
        result = subprocess.run(
            ["git", "diff", ref], capture_output=True, text=True, timeout=30
        )
        return result.stdout[:10000]

    @kernel_function(description="Get recent commit history.")
    def git_log(
        self, count: Annotated[int, "Number of commits"] = 10
    ) -> Annotated[str, "Commit log"]:
        result = subprocess.run(
            ["git", "log", f"-{count}", "--oneline"],
            capture_output=True, text=True,
        )
        return result.stdout

    @kernel_function(description="Create and checkout a new branch.")
    def git_create_branch(
        self, name: Annotated[str, "Branch name"]
    ) -> Annotated[str, "Confirmation"]:
        subprocess.run(["git", "checkout", "-b", name], check=True)
        return f"Created and checked out branch: {name}"

Agents (Subagents as ChatCompletionAgent)

Each worker in the Agentic OS maps to a ChatCompletionAgent with scoped plugins and instructions:

# agents/coder.py
from semantic_kernel.agents import ChatCompletionAgent
from semantic_kernel.connectors.ai.open_ai import AzureChatCompletion

def create_coder_agent(service: AzureChatCompletion) -> ChatCompletionAgent:
    """Create a coder agent with file and code execution capabilities."""
    from plugins.filesystem import FilesystemPlugin
    from plugins.code_exec import CodeExecPlugin

    return ChatCompletionAgent(
        service=service,
        name="Coder",
        instructions="""You are a code implementation specialist.
You write clean, tested, production-ready code.
Follow the project's existing patterns and conventions.

Rules:
- Read existing code before modifying it
- Follow PEP 8 and use type hints
- Keep changes minimal and focused
- Never modify files outside your task scope""",
        plugins=[FilesystemPlugin(), CodeExecPlugin()],
    )

# agents/tester.py
from semantic_kernel.agents import ChatCompletionAgent

def create_tester_agent(service) -> ChatCompletionAgent:
    """Create a tester agent — can read files and run tests, not write prod code."""
    from plugins.filesystem import FilesystemPlugin
    from plugins.code_exec import CodeExecPlugin

    return ChatCompletionAgent(
        service=service,
        name="Tester",
        instructions="""You are a test specialist. You write and run tests.
Write tests using pytest. Cover happy path, edge cases, and error cases.
Never modify production code — only test files.""",
        # Capability scoping: tester gets file read + code exec, scoped to tests
        plugins=[FilesystemPlugin(), CodeExecPlugin()],
    )

# agents/reviewer.py
from semantic_kernel.agents import ChatCompletionAgent

def create_reviewer_agent(service) -> ChatCompletionAgent:
    """Create a reviewer agent — read-only access, no file writes."""
    from plugins.filesystem import FilesystemPlugin
    from plugins.git_plugin import GitPlugin

    return ChatCompletionAgent(
        service=service,
        name="Reviewer",
        instructions="""You are a code review specialist.
Review code for quality, security, correctness, and style.
Identify issues by severity: critical, major, minor.
Never modify code — only review and report findings.""",
        # Capability scoping: reviewer gets read-only tools
        plugins=[FilesystemPlugin(), GitPlugin()],
    )

Cognitive Kernel: Sequential Orchestration

The kernel coordinates agents using Semantic Kernel’s orchestration patterns. For a feature workflow (code → test → review), a SequentialOrchestration maps directly to the Planner-Executor pattern:

# agents/kernel.py
import asyncio
from semantic_kernel.agents import ChatCompletionAgent, SequentialOrchestration
from semantic_kernel.agents.runtime import InProcessRuntime
from semantic_kernel.connectors.ai.open_ai import AzureChatCompletion

from agents.coder import create_coder_agent
from agents.tester import create_tester_agent
from agents.reviewer import create_reviewer_agent

async def run_feature_workflow(request: str) -> str:
    """
    Cognitive kernel: orchestrate coder → tester → reviewer
    using Semantic Kernel's SequentialOrchestration.
    """
    # Model provider layer
    service = AzureChatCompletion(
        deployment_name="gpt-4.1",
        endpoint="https://your-endpoint.openai.azure.com/",
    )

    # Create specialized agents (subagents with scoped capabilities)
    coder = create_coder_agent(service)
    tester = create_tester_agent(service)
    reviewer = create_reviewer_agent(service)

    # Sequential orchestration: Coder → Tester → Reviewer
    # Maps to the Pipeline coordination pattern (Ch. 23)
    orchestration = SequentialOrchestration(
        members=[coder, tester, reviewer],
    )

    # Start the in-process runtime
    runtime = InProcessRuntime()
    await runtime.start()

    # Invoke the orchestration with the user's request
    result = await orchestration.invoke(
        task=request,
        runtime=runtime,
    )
    output = await result.get()

    await runtime.stop_when_idle()
    return output


# For more complex tasks, use GroupChatOrchestration for adversarial patterns
async def run_review_cycle(request: str) -> str:
    """
    Adversarial pattern: Coder and Reviewer iterate until quality is met.
    Maps to GroupChatOrchestration (Ch. 23 Adversarial topology).
    """
    from semantic_kernel.agents import GroupChatOrchestration
    from semantic_kernel.agents.strategies import (
        KernelFunctionSelectionStrategy,
        KernelFunctionTerminationStrategy,
    )
    from semantic_kernel import Kernel

    service = AzureChatCompletion(
        deployment_name="gpt-4.1",
        endpoint="https://your-endpoint.openai.azure.com/",
    )
    coder = create_coder_agent(service)
    reviewer = create_reviewer_agent(service)

    # The manager kernel decides turn-taking and termination
    manager_kernel = Kernel()
    manager_kernel.add_service(service)

    orchestration = GroupChatOrchestration(
        members=[coder, reviewer],
        manager=manager_kernel,
        # Terminate when reviewer approves
        max_rounds=6,
    )

    runtime = InProcessRuntime()
    await runtime.start()
    result = await orchestration.invoke(task=request, runtime=runtime)
    output = await result.get()
    await runtime.stop_when_idle()
    return output

Governance: SK Function Filters

Semantic Kernel provides filters — interceptors that wrap function calls. This is the natural implementation of the Governance Plane’s middleware pattern:

# governance/filters.py
import time
import logging
from semantic_kernel.filters import FunctionInvocationContext
from semantic_kernel import Kernel

logger = logging.getLogger("governance")

# Capability scoping per agent
AGENT_CAPABILITIES = {
    "Coder": {"file_read", "file_write", "file_search", "code_exec",
              "git_diff", "git_create_branch"},
    "Tester": {"file_read", "file_search", "code_exec"},
    "Reviewer": {"file_read", "file_search", "git_diff", "git_log"},
}

class GovernanceFilter:
    """
    SK Function Filter that enforces governance policies.
    Maps to the Permission Gate and Auditable Action patterns.
    """

    def __init__(self):
        self.audit_log = []
        self.budget_remaining = 50000  # tokens

    async def on_function_invocation(
        self, context: FunctionInvocationContext, next
    ):
        function_name = context.function.name
        agent_name = context.arguments.get("agent_name", "unknown")

        # Pre-action: Capability check
        allowed = AGENT_CAPABILITIES.get(agent_name, set())
        if function_name not in allowed and allowed:
            logger.warning(f"DENIED: {agent_name} cannot use {function_name}")
            context.result = "Permission denied: insufficient capabilities"
            return

        # Pre-action: Budget check
        if self.budget_remaining <= 0:
            logger.warning("DENIED: Budget exhausted")
            context.result = "Budget exhausted"
            return

        # Execute the function
        start = time.time()
        await next(context)
        elapsed = time.time() - start

        # Post-action: Audit logging
        self.audit_log.append({
            "timestamp": time.time(),
            "agent": agent_name,
            "function": function_name,
            "elapsed_ms": int(elapsed * 1000),
        })
        logger.info(f"AUDIT: {agent_name} → {function_name} ({elapsed:.2f}s)")


def apply_governance(kernel: Kernel):
    """Register governance filters on a kernel instance."""
    gov = GovernanceFilter()
    kernel.add_filter("function_invocation", gov.on_function_invocation)
    return gov

Skill Definition

Skills package domain knowledge as reusable configurations:

# skills/python_backend.py
"""
A Skill in the Agentic OS is a bundle of instructions, plugin selections,
and strategies. In Semantic Kernel, this maps to a combination of:
- Agent instructions (system prompt)
- Plugin selection (which kernel_functions are available)
- Prompt templates (reusable patterns)
"""

PYTHON_BACKEND_SKILL = {
    "name": "python-backend",
    "description": "Develop Python backend services",
    "agent_instructions": """
Follow PEP 8. Use type hints on all public functions.
Write tests for all public functions using pytest.
Prefer composition over inheritance.
Handle errors explicitly — no bare except clauses.
Use Pydantic for data validation at API boundaries.
    """,
    "plugins": ["FilesystemPlugin", "CodeExecPlugin", "GitPlugin"],
    "strategies": {
        "new_endpoint": [
            "Read existing router to understand patterns",
            "Define Pydantic request/response models",
            "Implement the route handler",
            "Add input validation",
            "Write unit tests",
            "Update API documentation",
        ],
        "fix_bug": [
            "Read the bug report and identify the relevant code",
            "Write a failing test that reproduces the bug",
            "Fix the code to make the test pass",
            "Run the full test suite to check for regressions",
        ],
    },
}

Running It

# main.py
import asyncio
from agents.kernel import run_feature_workflow

async def main():
    result = await run_feature_workflow(
        "Add pagination to the user list API endpoint"
    )
    print(result)

if __name__ == "__main__":
    asyncio.run(main())

MCP Integration

MCP servers provide tool isolation. Connect agents to MCP tools via Semantic Kernel’s MCP integration:

{
  "mcpServers": {
    "filesystem": {
      "command": "python",
      "args": ["mcp_servers/filesystem/server.py"],
      "env": { "PROJECT_ROOT": "/workspace/my-project" }
    },
    "git": {
      "command": "python",
      "args": ["mcp_servers/git/server.py"]
    }
  }
}

This implementation demonstrates the core patterns: the kernel (SK orchestration with Sequential and GroupChat patterns), workers (ChatCompletionAgent subagents with scoped plugins), plugins (@kernel_function decorated methods as operators), skills (packaged instructions and strategies), and governance (SK function filters for capability checks, budget enforcement, and audit logging).

Try it yourself: The complete Coding OS — agents, skills, instructions, MCP config, and a sample To-Do API project with a hands-on tutorial — is available at implementations/coding-os/. Copy the .github/ folder into your project and start using @coder, @tester, @reviewer, /fix-bug, and /new-feature immediately.

Research OS

Software development has compilers to verify output. Research has no compiler. The artifacts are documents, analyses, and recommendations — things that require judgment to evaluate. This makes a Research OS fundamentally different from a Coding OS, and the differences reveal important truths about the Agentic OS model.

The Domain

Research — whether academic, market, competitive, or technical — involves:

Open-ended exploration: The answer is not known in advance. The process of searching shapes the question.
Source evaluation: Not all information is equal. Sources have varying credibility, recency, and relevance.
Synthesis: The value is not in individual facts but in the connections between them.
Argumentation: Research produces claims supported by evidence. The structure of the argument matters as much as the content.
Uncertainty: Research deals in probabilities, not certainties. “The evidence suggests…” not “The answer is…”

These properties demand different architectural choices than the Coding OS, while still fitting within the same Agentic OS framework.

Architecture

Cognitive Kernel

The Research OS kernel handles intents like:

“What are the leading approaches to X?” → Survey, compare, synthesize.
“Is Y a good strategy for our situation?” → Analyze Y, assess fit, identify risks.
“Summarize the state of the art in Z.” → Comprehensive literature review with structured synthesis.
“Find evidence for or against claim W.” → Balanced investigation with source evaluation.

The kernel classifies research tasks by:

Breadth: How many areas need to be covered?
Depth: How deeply must each area be examined?
Stance: Neutral survey, argument construction, or critical analysis?
Time sensitivity: Is this about current state, historical trends, or future predictions?

Process Fabric

Research workers are specialized differently:

Scout: Performs broad search across sources. Identifies potentially relevant documents, papers, articles, datasets. Optimized for recall — find everything that might be relevant.
Analyst: Deep-reads individual sources. Extracts key claims, evidence, methodology, limitations. Optimized for precision — miss nothing important in a single document.
Synthesizer: Combines findings from multiple analysts. Identifies patterns, contradictions, gaps. Produces structured summaries.
Critic: Evaluates the quality of evidence. Checks for bias, methodological flaws, outdated information, missing perspectives.
Writer: Produces the final research artifact — report, brief, memo, or presentation — incorporating all findings with proper attribution.

Memory Plane

Research memory has domain-specific tiers:

Source registry: Every source encountered, with metadata: URL, author, date, credibility assessment, key claims extracted. Prevents re-reading the same source and enables proper citation.
Claim graph: A network of claims and their supporting evidence. Claim A is supported by sources 1, 2, 3. Claim B contradicts claim A based on source 4. This graph is the intellectual product of the research process.
Method memory: Which research strategies worked for which types of questions. “For competitive analysis, start with industry reports, then company filings, then expert opinions.”
Knowledge base: Accumulated domain knowledge from past research. Concepts, relationships, definitions that do not need to be re-discovered.

Governance

Research-specific policies:

Source credibility thresholds: Claims based on unverified sources must be flagged. Peer-reviewed sources have higher weight than blog posts.
Bias detection: If the research draws heavily from one perspective, the system flags the imbalance and seeks counterpoints.
Citation requirements: Every factual claim in the output must link to a source. Unsourced claims are flagged.
Recency policies: For time-sensitive topics, sources older than a threshold are flagged or excluded.
Hallucination guard: The system must distinguish between information retrieved from sources and information generated by the model. Generated inferences are marked as such.

Workflow: Competitive Analysis

1. Intent Interpretation

“Analyze our top three competitors’ pricing strategies and recommend how we should position.”

The kernel interprets:

Who are the top three competitors? (Check memory; if unknown, ask the operator.)
What aspects of pricing? (Tiers, discount strategies, freemium models, enterprise pricing.)
What is “positioning”? (Where to price relative to competitors, not the exact price point.)
Implicit: Use current data. Consider our strengths and weaknesses. Be actionable.

2. Decomposition

flowchart TD
  T1["1. Identify competitors' pricing pages\n(Scout)"]
  T2["2. Search pricing press releases\n(Scout)"]
  T3["3. Search analyst reports\n(Scout)"]
  T4["4. Deep-analyze Competitor A\n(Analyst)"]
  T5["5. Deep-analyze Competitor B\n(Analyst)"]
  T6["6. Deep-analyze Competitor C\n(Analyst)"]
  T7["7. Synthesize pricing patterns\n(Synthesizer)"]
  T8["8. Evaluate evidence quality\n(Critic)"]
  T9["9. Develop recommendations\n(Synthesizer)"]
  T10["10. Write final report\n(Writer)"]

  T1 --> T4
  T2 --> T4
  T3 --> T4
  T1 --> T5
  T2 --> T5
  T3 --> T5
  T1 --> T6
  T2 --> T6
  T3 --> T6
  T4 --> T7
  T5 --> T7
  T6 --> T7
  T7 --> T8
  T7 --> T9
  T8 --> T9
  T9 --> T10

3. The Scout Phase

Scouts cast a wide net. They search multiple sources — company websites, news articles, industry reports, social media discussions, review platforms. Each result is logged in the source registry with metadata.

The critical discipline: scouts do not evaluate. They collect. Evaluation is the analyst’s job. This separation prevents premature filtering — a source that looks irrelevant to a scout might contain a crucial data point that an analyst would catch.

4. The Analysis Phase

Analysts read each source carefully and extract structured information:

flowchart TD
  subgraph Extraction["Source Analysis"]
    direction TB
    Src["Source: Competitor A Pricing Page\n(accessed 2026-04-01)"]
    subgraph Claims
      C1["Three tiers: Free, Pro $29/mo, Enterprise custom"]
      C2["Free tier limited to 3 users"]
      C3["Pro includes API access"]
      C4["Enterprise requires annual commitment"]
    end
    Conf["Confidence: High\n(primary source)"]
    Lim["Limitations: Does not show\nnegotiated enterprise pricing"]
  end
  Analyst[Analyst Agent] -->|reads & extracts| Extraction
  Extraction -->|structured output| Synth[Synthesizer]

Each analyst works independently with focused context — they see only the sources relevant to their assigned competitor.

5. Synthesis

The synthesizer receives all analyst outputs and produces a comparative view:

Where do pricing models converge? (All three have freemium tiers.)
Where do they diverge? (Only Competitor B offers monthly enterprise billing.)
What patterns emerge? (The market is moving toward usage-based pricing.)
What gaps exist? (No competitor publicly prices their data API.)

6. The Critic’s Role

The critic checks the synthesis against the evidence:

Is the claim “the market is moving toward usage-based pricing” supported? (Supported by Competitor B’s recent change and two analyst reports. Contradicted by Competitor C’s fixed pricing.)
Are any conclusions based on weak sources? (The blog post about Competitor A’s discounting strategy is from an anonymous author — flag as low confidence.)
Are alternative interpretations considered? (The freemium convergence might be survivorship bias — failed competitors without free tiers are not in the data.)

7. Output

The writer produces a structured report:

Executive summary with key findings.
Detailed comparative analysis with evidence links.
Confidence assessments for each major claim.
Positioning recommendations with supporting rationale.
Gaps and limitations section — what the research could not determine.

The Hallucination Problem

Research is the domain where hallucination is most dangerous. A coding error produces a compile failure. A research hallucination produces a plausible-sounding falsehood that might inform real decisions.

The Research OS addresses this through:

Source grounding: Every claim must trace to a retrieved source. Claims that the model generates without source support are labeled as inferences, not findings.
Confidence scoring: Each claim carries a confidence score based on source quality and corroboration. A claim supported by three independent credible sources scores higher than one supported by a single blog post.
Explicit uncertainty: The system uses calibrated language. “The evidence strongly suggests…” vs. “One source indicates…” vs. “No evidence was found for…”
Verification loops: Key claims are verified through independent searches. If the system cannot find corroborating sources, the claim is downgraded or flagged.

What Makes This an OS, Not a Search Engine

A search engine returns links. A research assistant summarizes pages. A Research OS conducts research: it formulates search strategies, evaluates evidence, builds arguments, identifies gaps, and produces structured knowledge.

The OS abstraction matters because research is a process, not a query. It has phases (scout, analyze, synthesize, critique), requires memory (source registry, claim graph), and benefits from governance (citation requirements, bias detection). These are not features bolted onto a search box — they are structural properties of a system designed for research.

Reference Implementation

The Research OS uses the Microsoft Agent Framework (Semantic Kernel) with specialized agents for each research phase and MCP servers for web access.

Plugins: Web Research

# plugins/web_research.py
from typing import Annotated
from semantic_kernel.functions import kernel_function
import httpx

class WebResearchPlugin:
    """Web search and content extraction for research agents."""

    @kernel_function(description="Search the web for a query.")
    async def web_search(
        self,
        query: Annotated[str, "Search query"],
        max_results: Annotated[int, "Max results to return"] = 10,
    ) -> Annotated[str, "Search results as JSON"]:
        async with httpx.AsyncClient() as client:
            response = await client.get(
                "https://api.tavily.com/search",
                params={"query": query, "max_results": max_results},
                headers={"Authorization": f"Bearer {TAVILY_API_KEY}"},
            )
            results = response.json()["results"]
            return json.dumps([
                {"url": r["url"], "title": r["title"], "snippet": r["content"]}
                for r in results
            ])

    @kernel_function(description="Fetch and extract content from a web page.")
    async def fetch_page(
        self, url: Annotated[str, "URL to fetch"]
    ) -> Annotated[str, "Extracted page content"]:
        from readability import Document
        async with httpx.AsyncClient(follow_redirects=True, timeout=15) as client:
            response = await client.get(url)
            doc = Document(response.text)
            return json.dumps({
                "url": url, "title": doc.title(),
                "content": doc.summary()[:10000],
            })

Agents: Research Specialists

# agents/research_agents.py
from semantic_kernel.agents import ChatCompletionAgent
from plugins.web_research import WebResearchPlugin

def create_scout_agent(service) -> ChatCompletionAgent:
    return ChatCompletionAgent(
        service=service,
        name="Scout",
        instructions="""You are a research scout. Search broadly for sources.
Return structured results. Do NOT evaluate or analyze — just collect.
Cover multiple source types: academic, industry, news, official.""",
        plugins=[WebResearchPlugin()],
    )

def create_analyst_agent(service) -> ChatCompletionAgent:
    return ChatCompletionAgent(
        service=service,
        name="Analyst",
        instructions="""You are a research analyst. For each source provided:
1. Extract key claims with exact quotes
2. Assess evidence quality (methodology, data, credibility)
3. Note limitations and potential biases
4. Rate confidence: high, medium, low
Be precise. Cite specific passages.""",
    )

def create_synthesizer_agent(service) -> ChatCompletionAgent:
    return ChatCompletionAgent(
        service=service,
        name="Synthesizer",
        instructions="""You are a research synthesizer. Combine analyst findings:
1. Identify patterns across sources
2. Flag contradictions with evidence from both sides
3. Note gaps where evidence is missing
4. Score confidence: strong (3+ sources), moderate (1-2), weak (single)
Every claim must cite its source.""",
    )

def create_critic_agent(service) -> ChatCompletionAgent:
    return ChatCompletionAgent(
        service=service,
        name="Critic",
        instructions="""You are a research critic. Review the synthesis for:
1. Claims based on weak or single sources
2. Underrepresented perspectives
3. Logical gaps or unsupported inferences
4. Source selection bias
Flag each issue with severity (critical, major, minor).""",
    )

Kernel: Sequential Research Pipeline

# agents/kernel.py
import asyncio
from semantic_kernel.agents import SequentialOrchestration
from semantic_kernel.agents.runtime import InProcessRuntime
from semantic_kernel.connectors.ai.open_ai import AzureChatCompletion

async def run_research(query: str) -> str:
    """Scout → Analyst → Synthesizer → Critic pipeline."""
    service = AzureChatCompletion(
        deployment_name="gpt-4.1",
        endpoint="https://your-endpoint.openai.azure.com/",
    )

    scout = create_scout_agent(service)
    analyst = create_analyst_agent(service)
    synthesizer = create_synthesizer_agent(service)
    critic = create_critic_agent(service)

    orchestration = SequentialOrchestration(
        members=[scout, analyst, synthesizer, critic],
    )

    runtime = InProcessRuntime()
    await runtime.start()

    result = await orchestration.invoke(task=query, runtime=runtime)
    output = await result.get()

    await runtime.stop_when_idle()
    return output

Governance: Citation Verification

# governance/research_policies.py
from semantic_kernel.filters import FunctionInvocationContext

async def citation_filter(context: FunctionInvocationContext, next):
    """Post-action filter: verify citations in synthesizer output."""
    await next(context)

    # Only check synthesizer and critic outputs
    agent_name = context.arguments.get("agent_name", "")
    if agent_name in ("Synthesizer", "Critic"):
        result = str(context.result)
        unsourced = find_unsourced_claims(result)
        if unsourced:
            context.result = (
                result + "\n\n⚠️ GOVERNANCE WARNING: "
                + f"{len(unsourced)} claims lack source citations."
            )

This implementation shows the research-specific patterns: sequential orchestration (scout → analyst → synthesizer → critic), web research plugin (search + page extraction as @kernel_function), citation governance (SK filter that checks outputs), and role separation (scout collects, analyst evaluates, synthesizer connects, critic challenges).

Try it yourself: The complete Research OS — agents (@scout, @analyst, @synthesizer, @critic), skills (/competitive-analysis, /literature-review), citation standards, and tutorial — is available at implementations/research-os/.

Support OS

Support is a domain defined by urgency, repetition, and empathy. Customers have problems they need solved now. Many of those problems are variations of the same underlying issues. And every interaction happens against the backdrop of a relationship — a customer’s history, their frustration level, their value to the business.

A Support OS turns these characteristics from challenges into advantages.

The Domain

Customer support has properties that map well to the Agentic OS model:

High volume, high repetition: A significant percentage of support requests are variations of known issues. An OS with memory can recognize and resolve them faster each time.
Clear resolution criteria: A support case is open or closed. The customer’s problem is solved or it is not. This gives the system unambiguous feedback.
Rich context: Customer data, product state, account history, past interactions, known issues, documentation — there is abundant context to inform resolution.
Escalation hierarchy: Simple issues are resolved by frontline support. Complex issues escalate to specialists. Critical issues escalate to engineering. This maps naturally to the staged autonomy model.
Time sensitivity: Support latency directly impacts customer satisfaction. Speed matters, but accuracy matters more — a fast wrong answer is worse than a slightly slower correct one.

Architecture

Cognitive Kernel

The Support OS kernel handles intents like:

“I can’t log in” → Authentication troubleshooting workflow.
“My data is missing” → Data integrity investigation.
“How do I configure X?” → Documentation retrieval and guided walkthrough.
“Your service is down” → Incident correlation and status communication.
“I want a refund” → Policy evaluation and fulfillment.

The kernel classifies support requests by:

Category: Account, billing, technical, feature request, complaint.
Urgency: Service down (critical), functionality broken (high), inconvenience (medium), question (low).
Complexity: Known issue with known fix (simple), known issue with variable fix (moderate), unknown issue (complex).
Sentiment: Frustrated, neutral, satisfied. This affects communication style, not resolution strategy.

Process Fabric

Support workers:

Triage Agent: Classifies the request, checks for known issues, gathers initial context. Fast — completes in seconds.
Resolver: Applies known solutions to known problems. Has access to the knowledge base, troubleshooting scripts, and account tools.
Investigator: Diagnoses unknown problems. Has access to logs, system state, and diagnostic tools. Operates with more autonomy and time.
Communicator: Crafts customer-facing responses. Adapts tone and detail level to the customer’s context and sentiment.
Escalator: When a case exceeds the system’s capability, prepares the escalation package — summary, investigation results, customer context — for a human agent.

Memory Plane

Support-specific memory:

Customer memory: Each customer’s history — past issues, resolutions, preferences, sentiment trends. “This customer had a billing issue last month that took three interactions to resolve. Handle with extra care.”
Issue knowledge base: Known issues, their symptoms, root causes, and resolutions. Indexed by symptoms for fast matching. “Error code 4012 → API rate limit exceeded → suggest upgrading plan or implementing backoff.”
Resolution patterns: What worked and what did not. “For login issues on mobile, resetting the session token resolves 78% of cases. Password reset resolves another 15%.”
Product state: Current system status, known outages, recent deployments, feature flags. “The payment service was deployed 2 hours ago — check for related issues.”

Governance

Support-specific policies:

Data access scoping: Workers can view customer data relevant to the issue but cannot export it, share it across cases, or retain it beyond the interaction.
Action limits: The system can reset sessions and resend verification emails autonomously. It cannot modify billing, issue refunds above a threshold, or access personal data without specific authorization.
Escalation triggers: Cases exceeding N minutes without resolution auto-escalate. Cases involving data loss auto-escalate. Cases from enterprise accounts get priority routing.
Tone policies: Responses must be professional, empathetic, and solution-oriented. No blame, no jargon, no conditional language that promises outcomes (“this will definitely fix it”).
Privacy compliance: All interactions are GDPR/CCPA-compliant. Customer data is not used for training. Conversations are retained per retention policy.

Workflow: Technical Support Case

flowchart TD
  I[1. Intake\nCustomer submits issue] --> T[2. Triage\nClassify, check context,\nknown issues, system state]
  T -->|match found| R[3. Resolution Attempt\nApply known fix]
  T -->|no match| INV[Investigate\nUnknown issue workflow]
  R --> V[4. Verification\nConfirm fix worked]
  V -->|success| C[5. Communication\nCraft customer response]
  V -->|failed| INV
  C --> L[6. Learning\nUpdate memory & patterns]

1. Intake

A customer submits: “My dashboard has been showing ‘loading’ for the past hour. I’ve tried refreshing and clearing cache.”

2. Triage

The triage agent processes in parallel:

Classify: Technical issue, high urgency, moderate complexity.
Customer context: Enterprise account, 2 years, 3 prior tickets (all resolved), no current sentiment flags.
Known issue check: Search the knowledge base for “dashboard loading” symptoms. Match found: Known issue #847 — dashboard timeout when data exceeds 10K rows, deployed fix in v3.2.1, but some accounts may need cache invalidation.
System state check: No current outages. Dashboard service healthy. But: a data migration ran 3 hours ago that may have affected row counts.

3. Resolution Attempt

The resolver has a match with high confidence:

Known issue #847 applies.
The customer’s account has 12K data rows (confirmed via account tools).
Cache invalidation is the likely fix.

The resolver executes the cache invalidation (Level 0 autonomy — this is a safe, reversible operation) and prepares a verification step.

4. Verification

The resolver checks: Is the dashboard now loading? It queries the customer’s session state. Dashboard loaded successfully in 2.3 seconds.

5. Communication

The communicator crafts the response:

“Hi [Name], I’ve identified and resolved the issue with your dashboard. It was caused by a data processing update that affected the dashboard cache for accounts with large datasets like yours. I’ve cleared the affected cache, and your dashboard should now load normally.

If you experience this again, please let us know — we’re monitoring to ensure it doesn’t recur.

Is there anything else I can help with?“

Note what happens here: the communicator does not say “we ran cache invalidation on your account” (too technical). It does not say “it was a known bug” (undermines confidence). It gives enough explanation to be satisfying, frames the resolution positively, and invites continued engagement.

6. Learning

The system updates its memory:

Customer memory: “Dashboard loading issue resolved via cache invalidation. Related to data migration. Time to resolution: 47 seconds.”
Resolution patterns: Known issue #847 resolution success rate updated.
Product state: “Data migration affected N accounts with large datasets. Cache invalidation required.”

Workflow: Unknown Issue

When the triage agent finds no known issue match, the workflow shifts:

1. Investigation

The investigator gets the case with a broader toolkit:

Access to application logs for the customer’s account.
Access to system metrics around the reported time.
Access to recent deployment history.
Access to similar past cases (even if they are not exact matches).

The investigator forms hypotheses and tests them:

Check logs for errors → Found: timeout on database query at 14:23.
Check database performance → Found: slow query on the analytics table.
Check recent schema changes → Found: missing index added in last migration but not applied to this shard.

2. Escalation Decision

The investigator has identified the root cause (missing index), but applying the fix (running the migration on the affected shard) exceeds the system’s autonomy level. This is a Level 3 action — it requires human engineering approval.

3. Escalation Package

The escalator prepares a handoff for the engineering team:

Summary: Customer dashboard timeout caused by missing database index on shard 7.
Evidence: Log timestamps, slow query plan, migration history showing shard 7 was skipped.
Proposed fix: Run pending migration on shard 7. Estimated impact: 2 minutes of read-only mode on the shard.
Customer context: Enterprise account, high priority. Customer informed that engineering is investigating.
Suggested customer response: Draft communication ready for review.

The human engineer gets everything needed to act — diagnosis, evidence, proposed fix, and customer context — in one package. Their job is to verify and approve, not to re-investigate.

The Human-AI Handoff

The Support OS is designed around seamless human-AI collaboration:

AI handles: Triage, known issue resolution, data gathering, response drafting.
Humans handle: Novel problems, judgment calls, policy exceptions, relationship-sensitive situations.
The handoff includes: Full context, investigation results, customer history, and draft communications. The human agent never starts from zero.

The system tracks which cases humans handle and why. Over time, if 80% of human handoffs for a particular issue type result in the same resolution, that resolution becomes a known pattern and the system handles it autonomously.

Metrics

First-contact resolution rate: Cases resolved without human involvement.
Mean time to resolution: From intake to confirmed resolution.
Escalation rate: Percentage of cases requiring human involvement, trending over time.
Customer satisfaction: Post-interaction ratings correlated with resolution type (AI vs. human vs. hybrid).
Knowledge base growth: New known issues added per week, resolution pattern accuracy.

What Makes This an OS, Not a Chatbot

A support chatbot matches keywords to canned responses. A Support OS resolves problems: it triages, investigates, applies fixes, verifies results, communicates appropriately, learns from outcomes, and knows when to involve a human.

The OS model provides what chatbots lack: memory across interactions (the customer’s history), process management (investigation workflows), governance (data access policies, escalation rules), and learning (resolution pattern improvement). These are not chatbot features — they are system properties that emerge from the OS architecture.

Reference Implementation

The Support OS uses Semantic Kernel with a HandoffOrchestration — the natural pattern for support workflows where control transfers between triage, resolution, and escalation agents based on context.

Plugins: Support Tools

# plugins/support_tools.py
from typing import Annotated
from semantic_kernel.functions import kernel_function

class SupportPlugin:
    """Support operations with data governance built in."""

    @kernel_function(description="Search knowledge base for matching known issues.")
    async def search_known_issues(
        self, symptoms: Annotated[str, "Symptom description"]
    ) -> Annotated[str, "Matching known issues as JSON"]:
        results = await kb_vector_store.similarity_search(symptoms, k=5)
        return json.dumps([{
            "id": r.metadata["issue_id"],
            "title": r.metadata["title"],
            "resolution": r.metadata["resolution"],
            "confidence": r.metadata["score"],
        } for r in results])

    @kernel_function(description="Get customer context for a ticket (no PII export).")
    async def get_customer_context(
        self, ticket_id: Annotated[str, "Ticket ID"]
    ) -> Annotated[str, "Customer context as JSON"]:
        customer = await db.get_customer_for_ticket(ticket_id)
        # Governance: PII fields are NOT included
        return json.dumps({
            "plan": customer.plan,
            "tenure_months": customer.tenure_months,
            "past_tickets_count": len(customer.tickets),
            "recent_tickets": [
                {"date": str(t.created), "category": t.category}
                for t in customer.tickets[-5:]
            ],
        })

    @kernel_function(description="Invalidate cache for a customer account.")
    async def invalidate_cache(
        self,
        account_id: Annotated[str, "Account ID"],
        cache_type: Annotated[str, "Cache type to invalidate"],
    ) -> Annotated[str, "Confirmation"]:
        await cache_service.invalidate(account_id, cache_type)
        return f"Cache '{cache_type}' invalidated for account {account_id}"

    @kernel_function(description="Search application logs for a customer.")
    async def search_logs(
        self,
        query: Annotated[str, "Log search query"],
        account_id: Annotated[str, "Account ID"],
        hours: Annotated[int, "Hours to search back"] = 24,
    ) -> Annotated[str, "Matching log entries as JSON"]:
        logs = await log_service.search(
            query=query, account_id=account_id,
            start_time=datetime.now() - timedelta(hours=hours),
        )
        return json.dumps([
            {"timestamp": str(l.ts), "level": l.level, "message": l.message[:200]}
            for l in logs[:50]
        ])

Agents: Support Specialists

# agents/support_agents.py
from semantic_kernel.agents import ChatCompletionAgent
from plugins.support_tools import SupportPlugin

def create_triage_agent(service) -> ChatCompletionAgent:
    return ChatCompletionAgent(
        service=service,
        name="Triage",
        instructions="""You are a support triage specialist.
For each ticket: classify category and urgency, load customer context,
check for matching known issues.
If a known issue matches with >80% confidence, hand off to Resolver.
If the issue is critical or unknown, hand off to Investigator.
Always include customer context in your handoff.""",
        plugins=[SupportPlugin()],
    )

def create_resolver_agent(service) -> ChatCompletionAgent:
    return ChatCompletionAgent(
        service=service,
        name="Resolver",
        instructions="""You are a support resolver. Apply known fixes.
Execute safe, reversible operations (cache invalidation, session reset).
After applying a fix, verify it worked.
If the fix fails, hand off to Investigator.""",
        plugins=[SupportPlugin()],
    )

def create_investigator_agent(service) -> ChatCompletionAgent:
    return ChatCompletionAgent(
        service=service,
        name="Investigator",
        instructions="""You are a support investigator. Diagnose unknown issues.
Search logs, check system metrics, form hypotheses and test them.
If you identify the root cause and a fix is within your capabilities, fix it.
If the fix requires engineering intervention, hand off to Escalator.""",
        plugins=[SupportPlugin()],
    )

def create_communicator_agent(service) -> ChatCompletionAgent:
    return ChatCompletionAgent(
        service=service,
        name="Communicator",
        instructions="""You are a customer communication specialist.
Write empathetic, clear, solution-focused responses.
No jargon. No blame. Adapt tone to customer sentiment.
Include what happened, what was done, and next steps.""",
    )

Kernel: Handoff Orchestration

The support workflow uses HandoffOrchestration — control passes dynamically between agents based on the situation:

# agents/kernel.py
import asyncio
from semantic_kernel.agents import HandoffOrchestration, ChatCompletionAgent
from semantic_kernel.agents.runtime import InProcessRuntime
from semantic_kernel.connectors.ai.open_ai import AzureChatCompletion

async def handle_support_ticket(ticket_message: str) -> str:
    """
    Handoff orchestration: Triage → Resolver or Investigator → Communicator.
    The HandoffOrchestration pattern maps to the Support OS's
    dynamic routing based on issue classification.
    """
    service = AzureChatCompletion(
        deployment_name="gpt-4.1",
        endpoint="https://your-endpoint.openai.azure.com/",
    )

    triage = create_triage_agent(service)
    resolver = create_resolver_agent(service)
    investigator = create_investigator_agent(service)
    communicator = create_communicator_agent(service)

    # Handoff orchestration: agents transfer control dynamically
    orchestration = HandoffOrchestration(
        members=[triage, resolver, investigator, communicator],
    )

    runtime = InProcessRuntime()
    await runtime.start()

    result = await orchestration.invoke(
        task=f"Support ticket: {ticket_message}",
        runtime=runtime,
    )
    output = await result.get()

    await runtime.stop_when_idle()
    return output

Key patterns demonstrated: handoff orchestration (dynamic routing between triage/resolver/investigator), plugins with data governance (customer PII scoped out of responses), capability separation (resolver can execute fixes, investigator can search logs, communicator only writes text), and semantic search over known issues (vector store as episodic memory).

Try it yourself: The complete Support OS — agents (@triage, @resolver, @investigator, @communicator), skills (/triage-and-resolve, /escalation), escalation templates, and tutorial — is available at implementations/support-os/.

Knowledge OS

Every organization has more knowledge than it can use. Documents are written and never read. Decisions are made and their rationale forgotten. Expertise lives in people’s heads and leaves when they do. Lessons are learned and re-learned.

A Knowledge OS does not just store information — it makes organizational knowledge operational: findable, connectable, maintainable, and applicable at the moment it is needed.

The Domain

Knowledge management has historically failed because it treats knowledge as a storage problem. Create a wiki, fill it with documents, hope people search before they ask. The result is always the same: the wiki becomes a graveyard of outdated pages.

The Agentic OS model reframes knowledge management as an active process:

Capture: Automatically extract knowledge from where it is created — conversations, documents, code, decisions — rather than requiring manual entry.
Connect: Link related knowledge across sources and domains. The architecture decision from six months ago is related to the bug report from last week.
Maintain: Continuously validate, update, and retire knowledge. Detect when information becomes stale.
Deliver: Surface knowledge at the moment it is needed, in the context where it is useful, without requiring the user to search.

Architecture

Cognitive Kernel

The Knowledge OS kernel handles intents like:

“What is our policy on X?” → Policy retrieval with applicability assessment.
“Why did we decide to use Y?” → Decision archaeology — tracing back through documents, discussions, and commits.
“What do we know about Z?” → Comprehensive knowledge assembly from multiple sources.
“Document this decision.” → Capture structured knowledge from context.
“Is our documentation on W still accurate?” → Validation against current state.

Process Fabric

Knowledge workers:

Harvester: Monitors information sources — chat channels, document repositories, code commits, meeting notes — and extracts knowledge artifacts. Runs continuously in the background.
Curator: Evaluates harvested knowledge for quality, relevance, and novelty. Deduplicates, categorizes, and links to related knowledge.
Validator: Periodically checks existing knowledge against current reality. Is this API documentation still accurate? Does this process still work? Are these guidelines still followed?
Retriever: Finds and assembles knowledge in response to queries. Goes beyond keyword search — understands the question’s intent and assembles a comprehensive answer from multiple sources.
Author: Produces structured knowledge artifacts — documentation, guides, FAQs, onboarding materials — from raw knowledge.

Memory Plane

The Knowledge OS memory plane is the product. Unlike other OS variants where memory supports the work, here memory is the work.

Document store: The canonical repository of structured documents — policies, procedures, architecture decisions, technical specifications.
Knowledge graph: A network of concepts, relationships, and facts extracted from all sources. “Service A depends on Service B” is a relationship. “We chose PostgreSQL because of JSONB support” is a decision node linked to the technology node.
Provenance layer: Every piece of knowledge tracks its origin: who created it, when, from what source, and how confident the system is in its accuracy. Provenance enables trust assessment.
Freshness index: A timestamp-and-signal system that tracks how likely each piece of knowledge is to be current. Documentation updated last week is probably fresh. Documentation last modified two years ago and referencing a deprecated API version is probably stale.
Usage analytics: What knowledge is accessed frequently? What knowledge is never accessed? What questions are asked that have no answer in the knowledge base? These signals guide curation priorities.

Governance

Knowledge-specific policies:

Classification: Knowledge is classified by sensitivity (public, internal, confidential, restricted) and access is scoped accordingly.
Retention: Knowledge follows retention policies. Temporary project notes expire. Architectural decisions are retained permanently.
Accuracy accountability: Knowledge artifacts have owners. When a validator finds stale content, the owner is notified.
Source authority: For conflicting information, the system applies a priority order — official documentation overrides chat conversations, which override individual notes.
Redaction: Sensitive information (credentials, personal data, financial details) is detected and redacted before knowledge is stored or shared.

Workflow: Knowledge Capture

flowchart LR
  Source[Meeting / Document /\nConversation] --> H[Harvest\nExtract decisions,\nrationale, actions]
  H --> Cu[Curate\nLink, deduplicate,\ncheck conflicts]
  Cu --> Au[Author\nProduce structured\nknowledge artifact]
  Au --> KG[Knowledge Graph\nIndexed & linked]

The Meeting That Produces Knowledge

A team holds an architecture review meeting. In a traditional organization, the knowledge from this meeting lives in the attendees’ memories and maybe a sparse set of meeting notes that no one reads.

In a Knowledge OS:

1. Harvest

The harvester processes the meeting transcript (or notes) and extracts:

Decisions: “We decided to use event sourcing for the order service.”
Rationale: “Because we need complete audit trails and the ability to replay events for debugging.”
Alternatives considered: “We considered CRUD with audit tables but rejected it because of the complexity of retroactive corrections.”
Action items: “Alex will prototype the event store by next sprint.”
Open questions: “We need to determine the event retention policy.”

2. Curate

The curator:

Links the decision to the order service node in the knowledge graph.
Links the rationale to the compliance requirements node (audit trails).
Checks for conflicts: Does this decision contradict any existing architecture guidelines? (None found.)
Identifies related knowledge: The team evaluated event sourcing for the payment service six months ago and decided against it — link both decisions so future readers see the full picture.

3. Author

The author produces an Architecture Decision Record (ADR):

# ADR-047: Event Sourcing for Order Service

## Status: Accepted
## Date: 2026-04-03

## Context
The order service requires complete audit trails for regulatory compliance
and the ability to replay events for debugging production issues.

## Decision
We will use event sourcing as the persistence strategy for the order service.

## Alternatives Considered
- CRUD with audit tables: Rejected due to complexity of retroactive corrections.

## Consequences
- Complete event history available for audit and debugging.
- Increased storage requirements for event streams.
- Team will need to learn event sourcing patterns.

## Related
- ADR-031: CRUD chosen for payment service (different requirements).
- Compliance requirement CR-12: Complete transaction audit trail.

This ADR is automatically filed, indexed, and linked in the knowledge graph. No one had to manually write or file it.

Workflow: Knowledge Retrieval

The Question That Finds Its Answer

A new engineer asks: “Why does the order service use event sourcing instead of a regular database?”

1. Intent Interpretation

The retriever understands this is a “why” question about an architectural decision, not a “how” question about implementation.

2. Knowledge Assembly

The retriever:

Finds ADR-047 (primary answer).
Finds the related ADR-031 about the payment service (contrast — helps explain why the choice differs between services).
Finds the compliance requirement CR-12 (underlying driver).
Finds the team’s event sourcing guidelines (practical context for the new engineer).

3. Response

The system produces a contextualized answer:

“The order service uses event sourcing (decided in ADR-047, April 2026) primarily because of regulatory requirements for complete audit trails (CR-12). The team considered CRUD with audit tables but found retroactive corrections too complex with that approach.

Note that the payment service uses traditional CRUD (ADR-031) because it had different requirements — simpler state transitions and no retroactive correction needs.

For implementation details, see the Event Sourcing Guidelines in the engineering handbook.“

This is not a search result. It is an answer — synthesized from multiple sources, contextualized for the question, with provenance.

Workflow: Knowledge Maintenance

The Document That Ages

The validator runs a periodic sweep and flags:

API documentation v2.1: Last updated 14 months ago. The API is now on v3.0. Multiple endpoints have changed. Status: Stale. Owner notified.
Onboarding guide: References a Slack channel that was archived 6 months ago. Status: Partially stale. Specific section flagged.
Deployment runbook: References a CI/CD pipeline that was replaced last quarter. Status: Stale. High priority — operational document.
Architecture overview: All referenced services still exist. Dependency graph matches current reality. Status: Current.

The validator does not just check dates — it cross-references knowledge against the current state of the systems, repositories, and configurations it can access.

The Knowledge Flywheel

The Knowledge OS creates a reinforcing cycle:

flowchart LR
  A[More knowledge\ncaptured] --> B[Better retrieval\nresults]
  B --> C[More people\nuse the system]
  C --> D[Better usage\nanalytics]
  D --> E[Smarter\ncuration]
  E --> F[Higher quality\nknowledge]
  F --> G[More\ntrust]
  G --> A

More knowledge captured → better retrieval results.
Better retrieval results → more people use the system.
More usage → better usage analytics → smarter curation.
Smarter curation → higher quality knowledge → more trust.
More trust → more knowledge contributed → back to step 1.

This flywheel is why the OS model matters. A static knowledge base has no flywheel — it degrades over time. An active Knowledge OS improves over time because every interaction makes the system smarter about what knowledge matters, how it connects, and when it is needed.

What Makes This an OS, Not a Wiki

A wiki stores pages. A Knowledge OS manages knowledge: it captures it from where it is created, connects it across domains, maintains it against drift, delivers it where it is needed, and learns from usage.

The OS provides what wikis lack: active processes (harvesting, curation, validation), structured memory (knowledge graphs, provenance, freshness), governance (classification, retention, accuracy), and adaptation (usage-driven curation, automated maintenance). The wiki asks humans to do all of this manually. The Knowledge OS automates the lifecycle while keeping humans in control of what matters — the knowledge itself.

Reference Implementation

The Knowledge OS centers on the memory plane — here, memory is the product. This implementation uses Semantic Kernel agents for harvesting and validation, with a PostgreSQL + pgvector knowledge store.

Plugin: Knowledge Store

# plugins/knowledge_store.py
from typing import Annotated
from semantic_kernel.functions import kernel_function
import asyncpg

class KnowledgeStorePlugin:
    """Knowledge graph operations with classification-based access control."""

    def __init__(self, pool: asyncpg.Pool):
        self.pool = pool

    @kernel_function(description="Store a knowledge artifact with embedding.")
    async def store_artifact(
        self,
        title: Annotated[str, "Artifact title"],
        content: Annotated[str, "Artifact content"],
        source: Annotated[str, "Origin: meeting, commit, document"],
        tags: Annotated[str, "Comma-separated tags"],
        classification: Annotated[str, "public, internal, or confidential"] = "internal",
    ) -> Annotated[str, "Confirmation with artifact ID"]:
        import uuid
        artifact_id = str(uuid.uuid4())
        embedding = await embed(content)
        await self.pool.execute("""
            INSERT INTO knowledge_artifacts
                (id, title, content, source, classification, tags, embedding, created_at)
            VALUES ($1, $2, $3, $4, $5, $6, $7, NOW())
        """, artifact_id, title, content, source, classification,
             tags.split(","), embedding)
        return f"Stored artifact {artifact_id}: {title}"

    @kernel_function(description="Search knowledge with access control.")
    async def search(
        self,
        query: Annotated[str, "Search query"],
        max_classification: Annotated[str, "Max classification level"] = "internal",
        limit: Annotated[int, "Max results"] = 10,
    ) -> Annotated[str, "Search results as JSON"]:
        embedding = await embed(query)
        levels = {"public": 0, "internal": 1, "confidential": 2}
        rows = await self.pool.fetch("""
            SELECT id, title, content, source, confidence,
                   1 - (embedding <=> $1) AS similarity
            FROM knowledge_artifacts
            WHERE classification_level <= $2
            ORDER BY embedding <=> $1 LIMIT $3
        """, embedding, levels[max_classification], limit)
        return json.dumps([dict(r) for r in rows])

    @kernel_function(description="Link two knowledge artifacts.")
    async def link_artifacts(
        self,
        from_id: Annotated[str, "Source artifact ID"],
        to_id: Annotated[str, "Target artifact ID"],
        relation: Annotated[str, "Relationship type"],
    ) -> Annotated[str, "Confirmation"]:
        await self.pool.execute("""
            INSERT INTO knowledge_links (from_id, to_id, relation)
            VALUES ($1, $2, $3) ON CONFLICT DO NOTHING
        """, from_id, to_id, relation)
        return f"Linked {from_id} → {to_id} ({relation})"

    @kernel_function(description="Find stale artifacts for review.")
    async def find_stale(
        self, max_age_days: Annotated[int, "Max age in days"] = 90,
    ) -> Annotated[str, "Stale artifacts as JSON"]:
        rows = await self.pool.fetch("""
            SELECT id, title, source, updated_at,
                   NOW() - updated_at AS age
            FROM knowledge_artifacts
            WHERE updated_at < NOW() - make_interval(days => $1)
            ORDER BY age DESC LIMIT 50
        """, max_age_days)
        return json.dumps([dict(r) for r in rows])

Agents: Knowledge Workers

# agents/knowledge_agents.py
from semantic_kernel.agents import ChatCompletionAgent
from plugins.knowledge_store import KnowledgeStorePlugin

def create_harvester_agent(service, store_plugin) -> ChatCompletionAgent:
    return ChatCompletionAgent(
        service=service,
        name="Harvester",
        instructions="""You extract structured knowledge from raw content.
For each input, identify:
- Decisions made and their rationale
- Facts and data points with sources
- Action items and owners
- Relationships to existing knowledge
Store each as a knowledge artifact using the store function.""",
        plugins=[store_plugin],
    )

def create_validator_agent(service, store_plugin) -> ChatCompletionAgent:
    return ChatCompletionAgent(
        service=service,
        name="Validator",
        instructions="""You validate knowledge freshness.
For each stale artifact: check if referenced entities still exist,
compare stored content with current reality.
Report status: current, stale, or partially_stale.
Suggest updates for stale artifacts.""",
        plugins=[store_plugin],
    )

def create_retriever_agent(service, store_plugin) -> ChatCompletionAgent:
    return ChatCompletionAgent(
        service=service,
        name="Retriever",
        instructions="""You answer questions using the knowledge base.
Search for relevant artifacts, synthesize a contextualized answer,
and cite sources. If no relevant knowledge exists, say so clearly.""",
        plugins=[store_plugin],
    )

Kernel: Knowledge Workflows

# agents/kernel.py
from semantic_kernel.agents import SequentialOrchestration, ChatCompletionAgent
from semantic_kernel.agents.runtime import InProcessRuntime
from semantic_kernel.connectors.ai.open_ai import AzureChatCompletion

async def capture_knowledge(raw_content: str, source_type: str) -> str:
    """Harvest → Curate pipeline for knowledge capture."""
    service = AzureChatCompletion(deployment_name="gpt-4.1",
                                  endpoint="https://your-endpoint.openai.azure.com/")
    store_plugin = KnowledgeStorePlugin(pool)

    harvester = create_harvester_agent(service, store_plugin)
    curator = ChatCompletionAgent(
        service=service,
        name="Curator",
        instructions="""Review harvested artifacts for quality.
Deduplicate, adjust tags, check for conflicts with existing knowledge.
Link related artifacts using the link function.""",
        plugins=[store_plugin],
    )

    orchestration = SequentialOrchestration(members=[harvester, curator])
    runtime = InProcessRuntime()
    await runtime.start()

    result = await orchestration.invoke(
        task=f"Source type: {source_type}\n\nContent:\n{raw_content}",
        runtime=runtime,
    )
    output = await result.get()
    await runtime.stop_when_idle()
    return output

async def validate_knowledge() -> str:
    """Periodic freshness validation of the knowledge base."""
    service = AzureChatCompletion(deployment_name="gpt-4.1-mini",
                                  endpoint="https://your-endpoint.openai.azure.com/")
    store_plugin = KnowledgeStorePlugin(pool)
    validator = create_validator_agent(service, store_plugin)

    runtime = InProcessRuntime()
    await runtime.start()

    # Direct agent invocation for single-agent task
    response = await validator.get_response(
        "Find and validate all stale artifacts older than 90 days."
    )
    await runtime.stop_when_idle()
    return str(response)

Key patterns: knowledge store as SK plugin (@kernel_function for search, store, link, validate), classification-based access control (pgvector queries filtered by classification level), harvester-curator pipeline (SequentialOrchestration), direct agent invocation (single-agent validation without orchestration overhead).

Try it yourself: The complete Knowledge OS — agents (@harvester, @curator, @validator, @retriever), skills (/harvest-knowledge, /validate-freshness), and tutorial — is available at implementations/knowledge-os/.

Multi-OS Coordination

The previous chapters described individual operating systems — each optimized for a single domain. But real organizations do not operate in single domains. A customer support case reveals a bug that requires engineering. A research finding changes the product strategy that changes the codebase that changes the documentation. Work flows across boundaries.

This chapter examines what happens when multiple Agentic OSs must work together.

The Coordination Problem

Each Agentic OS is designed for independence: its own kernel, its own memory, its own governance, its own process fabric. This independence is a strength — it allows each OS to optimize for its domain without compromise. But it creates a problem when work crosses domains.

Consider this scenario: A customer reports that the export feature produces corrupted CSV files. This involves:

Support OS: Receives the report, triages, gathers customer context.
Coding OS: Investigates the bug, writes a fix, runs tests.
Knowledge OS: Updates the known issues documentation and the troubleshooting guide.

Three operating systems, one workflow. How do they coordinate?

Federation Architecture

Multi-OS coordination follows a federation model: independent systems that collaborate through negotiated protocols.

The Federation Bus

The federation bus is the communication layer between operating systems. It carries:

Work requests: “Support OS to Coding OS: investigate CSV export corruption. Here is the customer report, reproduction steps, and relevant account data.”
Status updates: “Coding OS to Support OS: bug identified. Fix in progress. Estimated resolution: 2 hours.”
Results: “Coding OS to Support OS: fix deployed. Here is the change summary and verification results.”
Knowledge events: “Coding OS to Knowledge OS: new known issue — CSV export corruption caused by encoding mismatch in v3.2. Resolution: patch v3.2.1.”

Message Format

Inter-OS messages use a standard format:

flowchart LR
  subgraph Message
    direction TB
    Header["from: support-os \u2192 to: coding-os\ntype: work_request \u00b7 priority: high"]
    subgraph Payload
      Intent["Intent: Investigate and fix\nCSV export corruption"]
      Context["Context: customer_report,\nreproduction_steps,\naffected_versions: 3.2.0"]
      Constraints["Constraints:\ndata: customer_data_redacted\nurgency: urgent"]
    end
    Callback["Callbacks:\non_status_change \u2192 support-os/cases/4521/status\non_completion \u2192 support-os/cases/4521/resolution"]
  end
  SOS[Support OS] -->|sends| Message
  Message -->|received by| COS[Coding OS]

The message carries enough context for the receiving OS to act without knowing the sender’s internal state. It specifies constraints (data classification, urgency) that map to the receiver’s governance policies. And it includes callbacks so the sender is notified of progress.

Capability Discovery

Before OS A can send work to OS B, it must know what OS B can do. Capability discovery operates through a registry:

flowchart TD
  subgraph Registry["OS Capability Registry"]
    subgraph COS["coding-os"]
      CC["Capabilities: bug investigation,\nfeature development, code review, deployment"]
      CA["Accepts: work_request, information_request"]
      CS["SLA: bugs 4h, features 1-5d"]
      CG["Governance: up to confidential,\napproval for production"]
    end
    subgraph KOS["knowledge-os"]
      KC["Capabilities: doc update,\nknowledge retrieval, validation"]
      KA["Accepts: knowledge_event, query"]
      KS["SLA: docs 1h, retrieval seconds"]
      KG["Governance: up to internal"]
    end
  end
  OS_A[Requesting OS] -->|discover| Registry

Each OS publishes its capabilities, accepted message types, service level expectations, and governance constraints. Senders can discover what is available, what it costs, and what rules apply — without knowing the receiver’s internal architecture.

Coordination Patterns

Request-Response

The simplest pattern. OS A sends a work request, OS B processes it, OS B sends the result back. The Support OS requests a bug fix, the Coding OS delivers it.

This works for well-defined, self-contained work. It fails when the work requires ongoing collaboration.

Event-Driven

OSs publish events when significant things happen. Other OSs subscribe to relevant events and react.

Coding OS publishes: “Deployment completed: v3.2.1 with CSV fix.”
Support OS subscribes: Updates open cases related to the CSV bug.
Knowledge OS subscribes: Updates documentation and known issues.

Event-driven coordination is loosely coupled — publishers do not know who subscribes. This makes it easy to add new consumers without modifying producers.

Choreography

Multiple OSs collaborate through a series of events without a central coordinator. Each OS knows its role and reacts to events from others:

sequenceDiagram
  participant S as Support OS
  participant C as Coding OS
  participant K as Knowledge OS

  S->>S: Receives customer report
  S--)C: publishes "bug_reported"
  C->>C: Investigates bug
  C--)S: publishes "bug_fixed"
  C--)K: publishes "bug_fixed"
  K->>K: Updates documentation
  K--)S: publishes "docs_updated"
  S->>S: Notifies customer & closes case

Choreography works when the workflow is well-known and each participant’s role is clear. It becomes fragile when workflows are complex or when failures require coordinated recovery.

Orchestration

A coordinator OS (or a dedicated federation orchestrator) manages the workflow. It sends requests to each OS, monitors progress, handles failures, and ensures the workflow completes.

flowchart TD
  O1["1. Receive bug report\nfrom Support OS"] --> O2["2. Send investigation request\nto Coding OS"]
  O2 --> O3["3. Wait for fix confirmation"]
  O3 --> O4["4. Send documentation update\nto Knowledge OS"]
  O4 --> O5["5. Send resolution notification\nto Support OS"]
  O5 --> O6["6. Close workflow"]

Orchestration is more robust for complex workflows — the orchestrator maintains the overall state and can handle failures (retry, skip, escalate) with full visibility into the workflow’s progress.

Saga Pattern

For long-running, multi-OS workflows that may partially fail, the saga pattern provides compensating actions:

flowchart TD
  S1["1. Support OS: Reserve case"] -->|success| S2["2. Coding OS: Fix bug"]
  S2 -->|success| S3["3. Knowledge OS: Update docs"]
  S3 -->|failure| R["Retry: Knowledge OS\nretries documentation update"]
  R -->|success| Done[Workflow Complete]
  R -->|failure| Esc["Escalate to human for\nmanual documentation update"]
  S3 -->|success| Done

Each step has a compensating action defined. If a step fails, previous steps are compensated if necessary, or the workflow adapts. The key insight: not every failure requires rollback. A bug fix is valuable even if the documentation update fails. The saga pattern acknowledges partial success.

Cross-OS Governance

When work crosses OS boundaries, governance becomes complex. Each OS has its own policies, but the inter-OS workflow needs additional governance.

Data Classification at Boundaries

Customer data from the Support OS may be classified as confidential. The Coding OS may have a policy that confidential data does not enter debug logs. The federation layer must enforce data classification as information crosses boundaries.

Rules:

Data classification travels with the data.
The receiving OS must honor the classification or reject the data.
Data can be reclassified at boundaries (e.g., redacted to lower classification) but never silently upgraded.

Cross-OS Audit

The audit trail must span OS boundaries. When the Support OS triggers a bug fix in the Coding OS that triggers a documentation update in the Knowledge OS, the complete chain must be traceable.

Each OS maintains its internal audit log. The federation layer maintains a cross-OS correlation ID that links related entries across logs:

sequenceDiagram
  participant SOS as Support OS
  participant COS as Coding OS
  participant KOS as Knowledge OS
  Note over SOS,KOS: Correlation ID: fed-2026-04-03-4521
  SOS->>SOS: Case opened
  SOS->>COS: Escalate to engineering
  COS->>COS: Investigation started
  COS->>COS: Bug found
  COS->>COS: Fix deployed
  COS->>KOS: Document known issue
  KOS->>KOS: Documentation updated
  KOS->>KOS: Known issue added
  COS-->>SOS: Resolution delivered
  SOS->>SOS: Customer notified
  SOS->>SOS: Case closed

Cross-OS Authorization

When OS A asks OS B to perform an action, who authorizes it? Options:

Delegated authority: OS A’s operator authorized the work. OS B trusts OS A’s authorization within defined limits.
Independent authorization: OS B requires its own operator to approve, regardless of OS A’s request. Used for high-risk actions.
Policy-based: Pre-agreed policies determine which requests are auto-approved and which require explicit authorization. “Bug fixes with severity ≥ high are auto-approved. Feature requests require Coding OS operator approval.”

Failure Handling

Multi-OS failures are harder than single-OS failures because the failure may be in the communication, not in any individual OS.

Communication Failures

Timeout: OS A sent a request but OS B did not respond. Is OS B down, or just slow? The federation layer implements exponential backoff with a deadline.
Message loss: The request was lost in transit. The federation layer uses at-least-once delivery with idempotency checks.
Partial response: OS B sent results but the connection dropped midway. The federation layer uses chunked responses with resume capability.

Semantic Failures

Misunderstood request: OS A asked for X but OS B interpreted it as Y. The standard message format and capability registry reduce this risk, but it cannot be eliminated. Verification steps — where OS A checks OS B’s interpretation before execution — catch misunderstandings early.
Conflicting actions: Two OSs take actions that conflict. The Coding OS deploys a fix while the Support OS tells the customer the issue is still under investigation. Coordination timestamps and status synchronization prevent this.

The Organization as a System

Zoom out far enough, and the collection of coordinated Agentic OSs is an organization’s operational intelligence. The federation layer is the nervous system connecting specialized organs.

This perspective reveals design principles:

Specialize deeply, coordinate loosely. Each OS should be excellent at its domain and minimally dependent on others. Loose coupling through events and standard messages.
Fail independently, recover collectively. A failure in the Knowledge OS should not bring down the Coding OS. But recovery from a cross-OS workflow failure requires coordination.
Share knowledge, not state. OSs share knowledge events (“a bug was fixed”), not internal state (“this is my current task graph”). This preserves independence.
Govern at every boundary. Trust between OSs is not implicit. Every data exchange, every work request, every status update passes through governance checks.

When Not to Federate

Federation adds complexity. Before creating multiple OSs, ask:

Is the domain separation real? If the same team does support and coding, one OS with multiple skill sets may be simpler than two federated OSs.
Is the data separation necessary? If all OSs need the same data with the same access policies, the federation boundary creates friction without value.
Is independent evolution needed? If the systems change on different schedules with different teams, federation is justified. If one team builds everything, a monolithic OS with good internal boundaries may be better.

Federation is the right architecture when the organizational structure, security boundaries, and evolution timelines genuinely differ across domains. It is the wrong architecture when it is chosen for elegance rather than necessity.

Reference Implementation

Multi-OS coordination requires a federation layer. Each OS is an independent Semantic Kernel application. The federation bus connects them via a message protocol.

Federation Bus

# federation/bus.py
import asyncio
import uuid
from datetime import datetime
from dataclasses import dataclass, field
from enum import Enum

class MessageType(Enum):
    WORK_REQUEST = "work_request"
    STATUS_UPDATE = "status_update"
    RESULT = "result"
    EVENT = "event"

@dataclass
class FederationMessage:
    id: str
    correlation_id: str
    from_os: str
    to_os: str
    type: MessageType
    priority: str
    payload: dict
    timestamp: str = field(default_factory=lambda: datetime.utcnow().isoformat())
    data_classification: str = "internal"

class FederationBus:
    """Routes messages between independent Agentic OSs."""

    def __init__(self):
        self.queues: dict[str, asyncio.Queue] = {}
        self.audit_log: list[FederationMessage] = []
        self.registry: dict[str, dict] = {}

    def register_os(self, name: str, capabilities: list[str],
                    max_classification: str = "internal"):
        self.queues[name] = asyncio.Queue()
        self.registry[name] = {
            "capabilities": capabilities,
            "max_classification": max_classification,
        }

    async def send(self, message: FederationMessage):
        """Send with data classification enforcement at boundary."""
        receiver = self.registry.get(message.to_os, {})
        levels = {"public": 0, "internal": 1, "confidential": 2}
        if levels.get(message.data_classification, 0) > \
           levels.get(receiver.get("max_classification", "internal"), 1):
            raise PermissionError(
                f"Data '{message.data_classification}' exceeds "
                f"'{message.to_os}' clearance"
            )
        self.audit_log.append(message)
        await self.queues[message.to_os].put(message)

    async def receive(self, os_name: str) -> FederationMessage:
        return await self.queues[os_name].get()

    def find_os_for(self, capability: str) -> str | None:
        for name, info in self.registry.items():
            if capability in info["capabilities"]:
                return name
        return None

Federation Plugin for SK Agents

Each OS exposes a federation plugin that allows its agents to request help from other OSs:

# plugins/federation_plugin.py
from typing import Annotated
from semantic_kernel.functions import kernel_function

class FederationPlugin:
    """Allows agents to request capabilities from other OSs."""

    def __init__(self, bus, this_os: str):
        self.bus = bus
        self.this_os = this_os

    @kernel_function(description="Request a capability from another OS.")
    async def request_capability(
        self,
        capability: Annotated[str, "Capability to request"],
        description: Annotated[str, "Description of what is needed"],
        priority: Annotated[str, "Priority: critical, high, medium, low"] = "medium",
    ) -> Annotated[str, "Response from the other OS"]:
        target = self.bus.find_os_for(capability)
        if not target:
            return f"No OS provides capability: {capability}"

        correlation_id = str(uuid.uuid4())
        await self.bus.send(FederationMessage(
            id=str(uuid.uuid4()),
            correlation_id=correlation_id,
            from_os=self.this_os,
            to_os=target,
            type=MessageType.WORK_REQUEST,
            priority=priority,
            payload={"capability": capability, "description": description},
        ))

        result = await asyncio.wait_for(
            self.bus.receive(self.this_os), timeout=300
        )
        return json.dumps(result.payload)

    @kernel_function(description="Discover available capabilities across all OSs.")
    async def discover_capabilities(
        self,
    ) -> Annotated[str, "Available capabilities as JSON"]:
        return json.dumps({
            name: info["capabilities"]
            for name, info in self.bus.registry.items()
        })

Cross-OS Workflow: Bug Resolution

# federation/workflows/bug_resolution.py
"""
Cross-OS workflow using SK agents with federation plugin.
Support OS → Coding OS → Knowledge OS → Support OS
"""
from semantic_kernel.agents import ChatCompletionAgent, SequentialOrchestration
from semantic_kernel.agents.runtime import InProcessRuntime

async def bug_resolution(bus: FederationBus, ticket: dict):
    """Orchestrate bug resolution across three OSs."""
    service = AzureChatCompletion(
        deployment_name="gpt-4.1",
        endpoint="https://your-endpoint.openai.azure.com/",
    )

    # Coordinator agent with federation capability
    coordinator = ChatCompletionAgent(
        service=service,
        name="Coordinator",
        instructions="""You coordinate bug resolution across OSs.
1. Ask support-os to triage the ticket
2. Ask coding-os to investigate and fix the bug
3. Ask knowledge-os to update documentation
4. Ask support-os to notify the customer
Use the federation plugin to communicate with each OS.
Redact customer PII before sending to coding-os or knowledge-os.""",
        plugins=[FederationPlugin(bus, "orchestrator")],
    )

    runtime = InProcessRuntime()
    await runtime.start()

    response = await coordinator.get_response(
        messages=f"Bug ticket: {json.dumps(ticket)}"
    )

    await runtime.stop_when_idle()

    # Log cross-OS audit trail
    trace = [m for m in bus.audit_log
             if m.correlation_id == response.thread.id]
    log_correlation_trace(trace)

    return str(response)

Cross-OS Audit

# federation/audit.py

def log_correlation_trace(messages: list[FederationMessage]) -> dict:
    """Build an end-to-end audit trail across OS boundaries."""
    return {
        "started_at": messages[0].timestamp if messages else None,
        "completed_at": messages[-1].timestamp if messages else None,
        "os_involved": list(set(
            m.from_os for m in messages) | set(m.to_os for m in messages
        )),
        "message_count": len(messages),
        "events": [
            {
                "timestamp": m.timestamp,
                "from": m.from_os,
                "to": m.to_os,
                "type": m.type.value,
                "action": m.payload.get("capability", m.payload.get("action")),
            }
            for m in messages
        ],
    }

Key patterns: federation bus (message routing with data classification enforcement), federation plugin (@kernel_function enabling agents to call across OS boundaries), coordinator agent (single agent with federation capability orchestrating the cross-OS workflow), cross-OS audit trail (correlation IDs linking events across independent systems).

Try it yourself: The complete Multi-OS coordination — @coordinator agent with 4 domain delegates, federation governance instructions, and tutorial — is available at implementations/multi-os/.

GitHub Copilot and Claude Code as Agentic Operating Systems

The Agentic OS model is not a theoretical abstraction waiting for implementation. Two of the most widely adopted AI coding assistants — GitHub Copilot and Anthropic’s Claude Code — have independently converged on architectural patterns that mirror the layers, abstractions, and design patterns described throughout this book. Neither was designed by reading this model. Both arrived at similar structures because the problems they solve — coordinating autonomous work across tools, memory, governance, and human collaboration — demand it.

This chapter maps each system to the Agentic OS architecture, then contrasts and compares them through the lens of the design patterns catalogued in Part III.

GitHub Copilot

GitHub Copilot has evolved from an inline code completion tool into a full agentic system. In its current form — particularly in Copilot Chat’s agent mode and the Copilot Coding Agent — it exhibits all the core layers of an Agentic OS.

Cognitive Kernel

Copilot’s cognitive kernel is its agent mode orchestration layer. When a user issues a request like “fix the failing tests in this module,” the system does not simply generate code. It enters a loop:

Perceive: Gather context from the open file, workspace structure, terminal output, and diagnostic errors.
Interpret: Classify the request — is this a fix, a feature, a refactor, a question?
Plan: Determine a sequence of actions — read the test file, read the implementation, identify the failure, propose a fix, apply it, run the tests.
Execute: Invoke tools (file reads, terminal commands, code edits) in sequence.
Monitor: Check the results — did the tests pass? Did new errors appear?
Adapt: If tests still fail, analyze the new error and adjust the approach.

This is the Kernel Loop (perceive → interpret → plan → delegate → monitor → consolidate → adapt) described in Chapter 8. Copilot iterates autonomously, retrying with modified strategies when its first attempt does not produce a passing test suite.

The Intent Router pattern is visible in how Copilot classifies incoming requests. A simple question (“what does this function do?”) is handled inline with no tool invocation. A complex request (“add authentication to this API”) triggers multi-step planning with file discovery, code generation, and verification. The system routes along dimensions of complexity, mapping each to an appropriate execution strategy.

The Planner-Executor Split manifests in the separation between Copilot’s reasoning about what to do and its execution of individual tool calls. The planning phase produces a visible sequence of intended steps. The execution phase carries them out through discrete tool invocations.

The Reflective Retry pattern is one of Copilot’s most visible kernel behaviors. When a code edit introduces a compilation error or a test failure, the system does not blindly retry the same edit. It reads the error output, diagnoses whether the failure is a syntax issue, a missing import, a type mismatch, or a logic error, and modifies its approach accordingly.

Process Fabric

Copilot implements the process fabric through its subagent architecture. In VS Code’s agent mode, the system can delegate work to specialized agents:

@workspace: Searches and reasons across the entire codebase.
Custom agents (defined via .agent.md files): Scoped to specific tasks with their own instructions, tool restrictions, and context.
The Copilot Coding Agent: An autonomous worker that runs in a GitHub Actions environment, operating on its own branch with full tool access but isolated from the main codebase.

Each of these maps to the Subagent as Process pattern. They have bounded context (the agent receives only relevant files and instructions), scoped capabilities (each agent’s tool access is explicitly configured), and defined lifecycle (the agent completes its task and returns a result).

The Context Sandbox pattern appears in how Copilot assembles context for each agent invocation. Rather than dumping the entire codebase into the prompt, the system curates a focused context package: the relevant files, the current errors, the user’s instruction, and applicable configuration from .github/copilot-instructions.md and .instructions.md files. Irrelevant history and unrelated files are excluded.

The Scoped Worker Contract pattern manifests through the .agent.md configuration files. These files define what an agent does, what tools it can use, and what constraints it operates under — a formal contract between the user (as kernel) and the agent (as worker).

The Copilot Coding Agent implements the Ephemeral Worker pattern directly. When assigned a GitHub issue, it spins up a fresh environment, works on a feature branch, creates a pull request, and terminates. No state persists between invocations.

Memory Plane

Copilot’s memory architecture implements a tiered model:

Working memory: The current conversation context, open files, and terminal state. Small, fast, ephemeral — lost when the session ends.
Episodic memory: Copilot’s persistent memory feature stores key decisions, user preferences, and project-specific conventions across sessions. These are compressed summaries, not full transcripts.
Semantic memory: The codebase itself, indexed by Copilot’s workspace indexing. This is the project’s long-term knowledge, searchable by semantic and structural queries.
Convention memory: Instructions defined in .github/copilot-instructions.md, .instructions.md, and skill files (.md files in the prompts folder). These encode project-specific knowledge that persists across sessions and users.

The Layered Memory pattern is directly visible. Working memory is hot and ephemeral. Convention memory is warm and curated. The codebase index is cool and comprehensive.

The Memory on Demand pattern appears in how Copilot retrieves context. Rather than preloading the entire codebase, the agent issues targeted searches — finding files by name, searching for symbols, reading specific line ranges — pulling information into working memory only when needed.

The Pointer Memory pattern is present in Copilot’s handling of large codebases. Rather than loading full file contents, the system often works with file paths, symbol references, and structural summaries, fetching full content only for the specific sections it needs to read or modify.

Operator Fabric

Every tool Copilot can invoke is an operator in the Agentic OS sense:

File operations: Read, write, search — scoped to the workspace.
Terminal commands: Execute shell commands with output capture.
Code diagnostics: Access compiler errors, linter warnings, and test results.
Git operations: Stage, commit, create branches, read diffs.
Web search: Fetch documentation or search for solutions.
MCP servers: External tools exposed through the Model Context Protocol.

The Tool as Operator pattern is fully realized. Each tool has typed inputs, typed outputs, and is invoked through a uniform interface. The system decides which tools to call based on the current task.

The Operator Registry pattern appears in how Copilot discovers available tools. The system maintains a catalog of built-in tools, MCP server capabilities, and VS Code extension-provided tools. Agents can be configured to access only a subset of this registry through tool restrictions in .agent.md files.

The Skill over Operators pattern is implemented through Copilot’s skill and prompt file system. A skill file (stored in the user’s prompts folder) composes multiple operators into a higher-level workflow — a reusable recipe for a specific type of task. The #prompt: reference in a chat invocation loads the skill’s instructions, guiding the agent through a prescribed sequence of operator invocations.

MCP integration implements the Operator Isolation pattern. MCP servers run as separate processes, each with its own error boundary. A failing MCP server does not crash Copilot — the system reports the tool failure and continues with alternative approaches.

Governance Plane

Copilot implements governance at multiple levels:

Capability scoping: Custom agents can be restricted to specific tools. The Copilot Coding Agent operates on a feature branch, never directly on main.
Human approval gates: The Coding Agent creates a pull request rather than merging directly. The pull request is the Permission Gate — execution pauses for human authorization before the irreversible action of merging.
Content filtering: Copilot applies content policies that prevent generation of harmful, insecure, or policy-violating code.
Audit trail: Every action the Coding Agent takes is visible in the PR timeline — file changes, tool invocations, reasoning traces.
Organization policies: Enterprise administrators can configure allowed models, enabled features, and content exclusions.

The Risk-Tiered Execution pattern is implicit. Simple code completions (low risk) execute immediately. Agent-mode edits (moderate risk) are applied to the editor where the user can review. Autonomous Coding Agent work (high risk) goes through a pull request with CI checks and human review.

The Least Privilege Agent pattern is enforced structurally. The Coding Agent runs in a sandboxed environment with only the permissions needed for its task. It cannot access secrets, production systems, or repositories beyond its scope.

Claude Code

Anthropic’s Claude Code takes a different architectural approach to the same problem space. Where Copilot integrates deeply into a graphical IDE, Claude Code operates as a terminal-native agentic system — an autonomous coding agent that lives in the command line.

Cognitive Kernel

Claude Code’s kernel loop is explicit and visible in the terminal. Every interaction follows a transparent cycle:

Perceive: Read the user’s request and gather relevant context from the project.
Interpret: Determine what kind of task this is and what approach to take.
Plan: Reason about the steps needed, often displaying the plan to the user.
Execute: Invoke tools — file reads, writes, shell commands, searches — in a reasoned sequence.
Monitor: Check results after each action — did the file write succeed? Did the command produce expected output?
Consolidate: Synthesize results into a coherent response or continue iterating.

The Intent Router pattern operates in Claude Code through its classification of request complexity. A question about code is answered directly. A request to modify code triggers a multi-step workflow with file reads, edits, and verification. A complex feature request produces structured planning with multiple tool invocations.

The Execution Loop Supervisor pattern is visible in Claude Code’s resource management. The system tracks token usage and provides cost feedback. Sessions have practical boundaries — the system does not run indefinitely.

The Reflective Retry pattern is a core behavior. When a shell command fails, Claude Code reads the error output, diagnoses the root cause, and adjusts. If a test fails after a code change, it reads the failure output, reasons about the cause, and modifies the code — often through multiple iterations until the tests pass.

Process Fabric

Claude Code’s process model differs significantly from Copilot’s. Rather than delegating to named subagents, Claude Code can spawn subagents — child instances of itself that execute focused tasks in parallel:

Each subagent receives a specific task (“search the codebase for all uses of this function”) with a curated context.
Subagents operate in their own context window, isolated from the parent’s full conversation history.
Results are returned to the parent, which consolidates them.

This implements the Subagent as Process pattern directly. Each subagent has bounded context (only its task and relevant input), a clear lifecycle (spawn, execute, return, terminate), and isolation (its reasoning does not pollute the parent’s context).

The Parallel Specialist Swarm pattern appears when Claude Code fans out multiple searches or analyses concurrently. Rather than sequentially checking each file, it can spawn parallel subagents to explore different parts of the codebase simultaneously.

The Context Sandbox pattern is fundamental to Claude Code’s architecture. Each subagent interaction starts with a clean context, receiving only the task description and explicitly provided information. The parent agent curates what each child sees.

Claude Code also supports custom slash commands defined in .claude/commands/ directories. These are parameterized templates that encode specific workflows — analogous to the Scoped Worker Contract pattern, where the command definition specifies what the agent should do, with what inputs, under what constraints.

Memory Plane

Claude Code implements a sophisticated multi-tiered memory system:

Working memory: The active conversation context. Grows over the session, managed by the system’s context window handling.
Project memory (CLAUDE.md): A file at the project root containing persistent project-level conventions, build instructions, coding standards, and key architectural decisions. Read automatically at the start of every session.
Directory memory (CLAUDE.md in subdirectories): Scoped memory for specific parts of the codebase — module-level conventions, local patterns, and directory-specific instructions. Loaded when the agent works in that directory.
User memory (~/.claude/CLAUDE.md): Personal preferences and conventions that apply across all projects — formatting preferences, workflow habits, language choices.

This is a clean implementation of the Layered Memory pattern. User memory is the broadest and most persistent tier. Project memory is scoped to the repository. Directory memory is scoped to a module. Working memory is ephemeral.

The Memory on Demand pattern is how Claude Code manages larger codebases. It does not preload the entire project. Instead, it reads files, searches for symbols, and greps for patterns as needed — pulling information into working memory on demand.

The Operational State Board pattern is visible in how Claude Code tracks the state of multi-step tasks. It maintains a mental model of what has been done, what remains, and what the current blockers are, updating this model after each action.

The memory hierarchy also demonstrates the Compression Pipeline pattern at a human-curated level. Rather than storing full conversation transcripts, CLAUDE.md files contain compressed, curated summaries of the essential information — conventions, decisions, and patterns distilled from experience.

Operator Fabric

Claude Code’s tool set is its operator fabric:

File operations: Read, write, and edit files with precise string-matching edits.
Shell commands: Execute arbitrary terminal commands with output capture.
Search tools: Grep, glob, and structural search across the codebase.
Web fetch: Retrieve content from URLs for documentation or API references.
MCP servers: External tools integrated through the Model Context Protocol.
Notebook operations: Create, edit, and execute Jupyter notebook cells.

The Tool as Operator pattern applies uniformly. Every tool has a typed interface with explicit parameters and structured returns.

The Operator Registry pattern manifests through MCP configuration. Claude Code discovers available tools from MCP server declarations in .claude/settings.json and project-level configuration. The system presents available tools in its interface and selects appropriate ones based on the task.

The Composable Operator Chain pattern appears in Claude Code’s natural workflow. A typical operation chains: search for files → read relevant file → edit file → run tests → read test output → fix if needed. Each step’s output feeds the next step’s input, forming an observable pipeline.

The Operator Isolation pattern is enforced through Claude Code’s permission system. Certain tools (file writes, shell commands) require user approval before execution. Tools that fail produce structured error messages rather than crashing the session.

Governance Plane

Claude Code implements governance through a layered permission and policy system:

Permission tiers: Tools are classified by risk. File reads and searches execute freely. File writes and shell commands require user approval (unless pre-approved in configuration). Some operations are blocked entirely.
Allowed/denied tool lists: Project-level configuration can restrict which tools Claude Code may use, implementing the Capability-Based Access pattern.
Pre-approved commands: Specific shell commands can be approved in advance (e.g., npm test, python -m pytest), while others require per-invocation approval. This is the Risk-Tiered Execution pattern applied at the operator level.
Human escalation: When Claude Code encounters ambiguity, permission boundaries, or high-risk decisions, it asks the user — packaging the context, options, and its recommendation for a human decision.

The Permission Gate pattern is embedded in Claude Code’s tool invocation model. Every write operation and command execution passes through an approval checkpoint. The user can approve, deny, or modify before execution proceeds.

The Auditable Action pattern is implemented through the session transcript. Every tool invocation, its parameters, and its output are recorded in the conversation history, creating a complete audit trail of every action taken.

The Least Privilege Agent pattern is the default. Claude Code starts with limited permissions and requests escalation as needed. Users grant specific approvals rather than blanket access.

Contrast and Comparison

Both GitHub Copilot and Claude Code are Agentic Operating Systems for software development. Both implement the core layers. But they make fundamentally different architectural choices that lead to different strengths, tradeoffs, and user experiences.

Cognitive Kernel: IDE-Integrated vs. Terminal-Native

Dimension	GitHub Copilot	Claude Code
Environment	Embedded in VS Code / Visual Studio / JetBrains	Terminal-native, editor-agnostic
Kernel visibility	Semi-transparent; reasoning is partially visible in agent mode	Fully transparent; every reasoning step and tool call is visible in the terminal
Intent routing granularity	Multi-modal: inline completions, chat, agent mode, Coding Agent — four distinct execution strategies for different complexity levels	Single mode with adaptive depth: the same interface handles everything from questions to multi-file refactors
Iteration model	Agent mode iterates within the editor; Coding Agent iterates asynchronously via PR	Iterates interactively in the terminal with user oversight at each step

Pattern analysis: Both implement the Intent Router and Reflective Retry patterns, but Copilot’s router maps to four distinct execution modes (completion → chat → agent → Coding Agent), while Claude Code implements a single adaptive mode that scales its approach based on task complexity. Copilot’s model is closer to the Risk-Tiered Execution pattern applied at the kernel level — different risk tiers get different execution environments. Claude Code’s model is closer to a unified kernel with Staged Autonomy — the same kernel, with autonomy expanding as the user grants permissions.

Process Fabric: Named Specialists vs. Dynamic Subagents

Dimension	GitHub Copilot	Claude Code
Worker model	Named, pre-configured agents (`@workspace`, custom `.agent.md` agents, Coding Agent) with explicit role definitions	Dynamic subagents spawned on-demand with task-specific instructions
Specialization	Static: each agent has a fixed identity, instruction set, and tool access	Dynamic: subagents are instantiated with context relevant to the current task
Isolation	Strong isolation in Coding Agent (separate environment, own branch); moderate isolation in custom agents (scoped tools)	Context-level isolation: each subagent has its own context window but shares the same runtime
Parallelism	Coding Agent runs asynchronously; custom agents invoked sequentially in chat	Subagents can be dispatched in parallel for concurrent exploration

Pattern analysis: Copilot’s named agents map to the Reusable Worker Archetypes pattern — pre-defined templates (coder, tester, reviewer) that are instantiated consistently. Claude Code’s dynamic subagents map more directly to the Ephemeral Worker pattern — spawned for a specific task, given focused context, results collected, worker discarded. Copilot invests in archetype definition upfront; Claude Code invests in runtime context curation.

The Coding Agent is Copilot’s strongest implementation of the Subagent as Process pattern — a truly isolated process with its own environment, branch, and lifecycle, connected to the main system only through the PR interface.

Memory Plane: Multi-Source Configuration vs. File-Based Hierarchy

Dimension	GitHub Copilot	Claude Code
Persistent memory	Copilot Memory (cloud-stored preferences), `.github/copilot-instructions.md`, `.instructions.md` files, skill files	`CLAUDE.md` files at user, project, and directory levels
Memory hierarchy	Organization → repository → directory → file → session	User → project → directory → session
Memory authoring	Multiple file types with different scopes and YAML frontmatter configuration	Single file format (`CLAUDE.md`) at different directory levels
Codebase knowledge	Workspace indexing with semantic search	On-demand search (grep, glob, file reads)

Pattern analysis: Both implement Layered Memory, but with different tier boundaries. Copilot adds an organization tier (enterprise-level policies that cascade into all repositories) and uses rich indexing for semantic memory. Claude Code’s memory model is simpler and file-centric — the CLAUDE.md hierarchy is easy to understand, version-control, and share, but it does not include organizational-level memory.

Copilot’s workspace indexing is a stronger implementation of the Semantic Memory tier — pre-indexed, searchable by meaning. Claude Code’s on-demand search is more aligned with the Memory on Demand pattern — no pre-indexing, lower overhead, but potentially slower retrieval on very large codebases.

The Convention Memory concept from the Coding OS case study maps differently: Copilot distributes conventions across instruction files, prompt files, and skill definitions. Claude Code consolidates them into CLAUDE.md. The tradeoff is flexibility versus simplicity.

Operator Fabric: Extensible Platform vs. Direct Tool Access

Dimension	GitHub Copilot	Claude Code
Built-in tools	File ops, terminal, diagnostics, search, code actions, VS Code API	File ops, terminal, search, web fetch, notebook ops
Extension model	VS Code extensions, MCP servers, GitHub Apps	MCP servers
Tool discovery	Automatic via VS Code extension API and MCP server registration	Configuration-based via `.claude/settings.json` and MCP config
IDE integration	Deep: inlined code suggestions, error decorations, diff previews, diagnostic access	Minimal: operates alongside the editor via terminal

Pattern analysis: Copilot’s operator fabric is significantly larger due to VS Code’s extension ecosystem. Any VS Code extension can expose tools to Copilot, making the Operator Registry pattern a platform-level feature. Claude Code’s operator set is leaner but more uniform — every tool works the same way regardless of source.

Both support MCP servers, implementing Operator Isolation through process-level separation. But Copilot also implements Operator Adapters at scale — the VS Code extension API is an adapter layer that normalizes thousands of heterogeneous extensions into a uniform tool interface.

The Skill over Operators pattern appears in both systems but with different emphasis. Copilot’s skill/prompt files define multi-step workflows with rich metadata. Claude Code’s custom slash commands serve a similar purpose but with simpler structure — a parameterized prompt template rather than a full skill definition.

Governance Plane: Enterprise Governance vs. User-Centric Control

Dimension	GitHub Copilot	Claude Code
Permission model	Organization policies → repository settings → user preferences	Per-session permissions with pre-approved command lists
Approval flow	Coding Agent: PR-based approval. Agent mode: inline user review	Per-tool-invocation approval with optional pre-approval
Audit trail	PR timeline, Copilot logs, organization audit logs	Session transcript
Trust boundaries	Organization-level content exclusions, model restrictions, feature toggles	Project-level tool restrictions, user-level settings
Enterprise features	SSO, seat management, usage analytics, IP indemnity, content exclusion policies	Configuration-file-based project settings

Pattern analysis: This is the sharpest architectural divergence. Copilot implements the Governance Plane as a full enterprise system with organizational policy inheritance, centralized management, and compliance features. This maps directly to the Capability-Based Access and Signed Intent patterns at an organizational scale — policies flow from the organization through repositories to individual agents.

Claude Code implements governance as user-centric control. The individual developer is the governance authority. Permission decisions are local and immediate — approve this command, deny that file write. This maps to the Permission Gate and Human Escalation patterns at the individual interaction level.

The Copilot model is stronger for Governed Extensibility — new capabilities (extensions, agents, MCP servers) go through organizational approval flows before they become available. Claude Code’s model is more aligned with Staged Autonomy — the user progressively grants trust through pre-approved commands and permission settings.

Comparative Pattern Coverage

The following table maps the design patterns from Parts III and IV against their implementation in each system:

Pattern	Copilot	Claude Code
Intent Router	Multi-modal (completions, chat, agent, Coding Agent)	Single adaptive mode
Planner-Executor Split	Visible in agent mode step display	Visible in terminal reasoning trace
Reflective Retry	Autonomous iteration with error analysis	Interactive iteration with visible diagnosis
Execution Loop Supervisor	Implicit via session and budget limits	Token tracking and cost reporting
Subagent as Process	Coding Agent (strong); custom agents (moderate)	Dynamic subagents with context isolation
Context Sandbox	Curated context from workspace indexing and instruction files	Task-specific context packages for subagents
Ephemeral Worker	Coding Agent: spins up, works, creates PR, terminates	Subagents: spawn, execute, return, terminate
Scoped Worker Contract	`.agent.md` files with tool restrictions	Custom slash commands with parameterized templates
Parallel Specialist Swarm	Limited; Coding Agent is single-agent	Subagent parallelism for concurrent exploration
Layered Memory	Multi-source (cloud memory, instruction files, workspace index)	File hierarchy (`CLAUDE.md` at multiple levels)
Memory on Demand	Semantic workspace search + targeted file reads	Grep, glob, and file reads on demand
Pointer Memory	File references and symbol navigation	File paths and line-range reads
Tool as Operator	Uniform tool interface via VS Code extension API	Uniform tool interface via built-in tools and MCP
Operator Registry	VS Code extensions + MCP servers (large catalog)	MCP servers (focused catalog)
Skill over Operators	Prompt/skill files with metadata	Custom slash commands
Operator Isolation	MCP process isolation + extension host isolation	MCP process isolation + permission gates
Capability-Based Access	Organization → repo → agent tool scoping	Project-level tool allow/deny lists
Permission Gate	PR-based approval for Coding Agent	Per-invocation tool approval
Least Privilege Agent	Scoped tool access per agent; sandbox for Coding Agent	Default-deny with explicit approval
Human Escalation	PR review; inline accept/reject in agent mode	Interactive approval in terminal
Auditable Action	PR timeline + organization audit logs	Session transcript
Risk-Tiered Execution	Four execution tiers (completion → chat → agent → Coding Agent)	Two tiers (auto-approved vs. requires-approval)
Staged Autonomy	Trust varies by execution mode	User progressively pre-approves commands
Reusable Worker Archetypes	Pre-defined agent roles via `.agent.md`	Not explicit; subagents are task-defined
Domain-Specific Agentic OS	Specialized for software development within an IDE	Specialized for software development via terminal
Governed Extensibility	Organization-level extension and feature management	Project-level configuration

Synthesis

GitHub Copilot and Claude Code validate the Agentic OS model from opposite ends of the design spectrum:

Copilot is an enterprise-grade, IDE-embedded Agentic OS. It prioritizes deep integration with the developer’s visual environment, organizational governance, and a rich extension ecosystem. Its architecture favors the patterns that support platform-scale operation: Operator Registry, Capability-Based Access, Governed Extensibility, Risk-Tiered Execution, and Reusable Worker Archetypes. The tradeoff is system complexity — multiple configuration surfaces, multiple execution modes, and organizational policy layers.

Claude Code is a developer-centric, terminal-native Agentic OS. It prioritizes transparency, simplicity, and direct user control. Its architecture favors the patterns that support individual effectiveness: Ephemeral Worker, Context Sandbox, Memory on Demand, Permission Gate, Staged Autonomy, and Human Escalation. The tradeoff is that organizational governance and IDE integration are minimal.

Both prove the same thesis: building effective AI coding assistants requires the structures this book describes — a cognitive kernel that plans and adapts, a process fabric that isolates and coordinates work, a memory plane that manages knowledge across tiers, an operator fabric that provides governed access to tools, and a governance plane that enforces policy without killing throughput.

The convergence is not coincidental. These are not features chosen from a menu. They are the necessary structures that emerge when you try to build a system that autonomously acts on intent in a complex, tool-rich, high-stakes domain. The Agentic OS model did not predict Copilot and Claude Code specifically — it describes the architectural forces that made them inevitable.

The question for practitioners: Which pattern profile matches your needs? If you operate in an enterprise with organizational policies, multi-team repositories, and compliance requirements, the Copilot model — with its deep IDE integration, organizational governance, and platform extensibility — is the natural fit. If you prioritize transparency, direct control, and terminal-native workflows, Claude Code’s simpler architecture — with its file-based memory hierarchy, per-invocation approval, and visible reasoning — may be more effective. In either case, you are using an Agentic OS. The patterns are the same. The emphasis differs.

From Software Engineering to Intent Engineering

Software engineering is the discipline of turning requirements into working systems. It has matured over decades — from ad hoc coding to structured programming, from waterfall to agile, from monoliths to microservices. Each transition reflected a deeper understanding of what makes software succeed.

We are at another transition. The systems we are building no longer just execute code. They interpret intent, make decisions, and act autonomously. The discipline that builds these systems is not software engineering as we know it. It is something new.

Call it intent engineering.

The Shift in What We Build

In traditional software engineering, the human does the thinking and the computer does the executing. The programmer’s job is to translate a solution into a language the machine can follow. Every conditional, every loop, every data structure is an explicit instruction.

In agentic systems, the human expresses a goal and the system figures out how to achieve it. The engineer’s job shifts from writing instructions to designing the conditions under which good decisions emerge.

This is a fundamental change in the unit of work:

	Software Engineering	Intent Engineering
Input	Requirements specification	Intent and constraints
Output	Deterministic program	Adaptive system
Design focus	Algorithms and data structures	Decision architectures
Quality measure	Correctness (does it do what the spec says?)	Alignment (does it do what was meant?)
Failure mode	Bugs (incorrect execution)	Misalignment (correct execution of wrong intent)
Testing	Deterministic assertions	Behavioral evaluation
Maintenance	Fix code	Evolve policies, skills, and memory

What Intent Engineers Do

An intent engineer does not primarily write code — though code is part of the work. An intent engineer designs, builds, and maintains the systems that make agentic behavior reliable.

Design the Cognitive Architecture

How should the kernel interpret requests? What decomposition strategies apply to this domain? How deep should planning go? These are architectural decisions, but they are not about databases and message queues. They are about reasoning structures.

The intent engineer decides:

When the system should plan vs. act directly.
How much autonomy each task type warrants.
What context is needed for reliable decision-making.
Where the boundaries between agents should fall.

Craft Skills and Instructions

Skills are the domain knowledge that makes an agentic system competent. Writing a good skill — one that produces consistently high-quality results — is a design discipline.

It requires:

Deep domain understanding (what does “good” look like in this domain?).
Clarity of expression (can the model follow these instructions reliably?).
Empirical validation (do these instructions produce better results than alternatives?).
Iterative refinement (where do the instructions fail, and how can they be improved?).

This is not prompt engineering in the sense of clever tricks to get a model to do something. It is systematic design of the knowledge and strategy layer that guides model behavior.

Design Governance Policies

What should the system be allowed to do? What should it never do? What should require human approval? These questions are not afterthoughts — they are primary design decisions.

The intent engineer designs:

Risk classification schemes for actions.
Autonomy levels for different contexts.
Escalation flows for uncertain situations.
Audit requirements for accountability.

Build Evaluation Frameworks

How do you know the system is working well? In traditional software, you write unit tests. In agentic systems, evaluation is harder because correct behavior is often a matter of judgment, not a boolean.

The intent engineer builds:

Benchmark suites that test system behavior across representative scenarios.
Quality rubrics that score outputs on multiple dimensions (correctness, completeness, style, safety).
Regression detection that catches degradation in system performance over time.
A/B testing frameworks that compare system variants.

Manage the Memory Lifecycle

What should the system remember? For how long? How should memories be organized, validated, and retired? The intent engineer designs the memory architecture and the processes that keep it healthy.

Skills of the Intent Engineer

Intent engineering draws from multiple existing disciplines but combines them in new ways:

From Software Engineering

Systems thinking: understanding component interactions and emergent behavior.
Interface design: defining clean boundaries between subsystems.
Testing discipline: systematic verification of behavior.
Operational awareness: building systems that can be monitored and debugged.

From Product Design

User empathy: understanding what operators actually need vs. what they say.
Interaction design: crafting how humans and systems collaborate.
Iterative design: building, testing, learning, refining.

From Cognitive Science

Mental models: understanding how the system’s reasoning works and fails.
Decision theory: designing environments where good decisions are likely.
Bias awareness: recognizing systematic reasoning failures and designing around them.

From Policy and Governance

Risk assessment: classifying and managing operational risk.
Compliance design: building systems that meet regulatory requirements by construction.
Accountability structures: ensuring actions can be traced and explained.

New Skills

Behavioral debugging: When the system produces a wrong result, diagnosing why — not in the code, but in the reasoning. Was the context wrong? Was the instruction ambiguous? Was the plan flawed? Was the governance too loose?
Instruction design: Writing instructions that produce reliable behavior across diverse inputs. This is harder than it sounds — natural language is ambiguous, and models are sensitive to phrasing.
Alignment verification: Confirming that the system’s actions match the operator’s intent, not just their words. This requires understanding what the operator meant, not just what they said.

The Intent Engineering Process

Intent engineering has its own development lifecycle:

flowchart LR
  IM[1. Intent\nModeling] --> AD[2. Architecture\nDesign]
  AD --> SD[3. Skill\nDevelopment]
  SD --> BT[4. Behavioral\nTesting]
  BT --> DM[5. Deployment &\nMonitoring]
  DM --> CR[6. Continuous\nRefinement]
  CR -.->|iterate| IM

1. Intent Modeling

Before building anything, model the intents the system must handle. What do operators ask for? What do they mean? What do they expect? What constraints are implicit?

This is the requirements phase, but the requirements are not features — they are goals, with all their ambiguity.

2. Architecture Design

Design the cognitive architecture: kernel behavior, decomposition strategies, worker types, memory structure, governance policies. This is the blueprint for decision-making, not for computation.

3. Skill Development

Build, test, and refine the skills that make the system competent in its domain. Each skill goes through cycles of design, testing, evaluation, and refinement.

4. Behavioral Testing

Test the system’s behavior across a wide range of scenarios. Not just “does it produce the right output” but “does it behave appropriately” — handling ambiguity, managing uncertainty, escalating when necessary, and staying within governance boundaries.

5. Deployment and Monitoring

Deploy the system with monitoring for behavioral quality. Track not just uptime and latency, but decision quality, alignment accuracy, and governance compliance.

Use operational data to improve the system. Update skills based on failure analysis. Refine governance policies based on incident patterns. Expand memory based on recurring needs.

The Profession

Intent engineering is not a role that can be filled by a single person. It is a discipline practiced by teams that combine technical skill, domain knowledge, and design sensibility.

Today, this discipline is practiced informally — by prompt engineers, AI engineers, and product designers who are inventing the practice as they go. Tomorrow, it will be as structured as software engineering, with its own principles, patterns, certifications, and body of knowledge.

This book is one attempt to lay the foundation for that discipline.

The transition from software engineering to intent engineering is not a replacement — software engineering remains essential. It is an expansion. We are adding a new layer to the stack of how humans build useful systems. Software engineering builds the machine. Intent engineering teaches it to reason.

Designing Agency Responsibly

We are building systems that act. Not systems that respond, not systems that suggest — systems that do things in the world. They write code that runs in production. They send messages to customers. They make decisions that affect people’s work and lives. This is agency, and agency demands responsibility.

This chapter is not about AI ethics in the abstract. It is about the concrete design decisions that determine whether an agentic system is trustworthy.

What Agency Means

Agency is the capacity to act independently in pursuit of a goal. A thermostat has minimal agency — it acts (turns on the heater) independently (without human intervention) in pursuit of a goal (reaching the set temperature). An Agentic OS has significantly more: it interprets goals, plans multi-step strategies, chooses among alternatives, and adapts to feedback.

More agency means more capability. It also means more ways to cause harm.

The harm is rarely dramatic. Agentic systems are unlikely to “go rogue” in cinematic fashion. The real risks are mundane:

A system that optimizes for speed and deploys untested code.
A system that resolves support tickets by giving customers wrong information.
A system that automates away a decision that required human judgment.
A system that perpetuates a bias present in its training data or memory.
A system that leaks sensitive information by including it in a context window sent to a third-party model.

These are not hypothetical risks. They are engineering failures that happen when agency is granted without adequate design discipline.

Principles for Responsible Agency

1. Autonomy Must Be Earned, Not Assumed

No agentic system should start with full autonomy. The staged autonomy model (Chapter 24) is not just a safety mechanism — it is a trust-building protocol. The system begins with narrow autonomy, demonstrates reliability, and earns broader independence over time.

Trust escalation must be:

Observable: The operator can see what the system has done and how well.
Gradual: Autonomy increases in small, verifiable steps.
Revocable: Trust can be reduced at any time if the system’s performance degrades.
Scoped: Higher autonomy in one domain does not automatically grant it in another.

2. Actions Must Be Explainable

An agentic system that cannot explain why it did something is an opaque risk. Every significant action must be traceable to:

The intent that motivated it.
The plan that included it.
The policy that permitted it.
The evidence that justified it.

This does not mean the system must produce a philosophical justification for every file read. It means the audit trail must be sufficient for a human to reconstruct the reasoning chain after the fact.

Explainability has a design cost — logging, structured plans, decision metadata. This cost is non-negotiable. A system that is fast but unexplainable is a system that will be shut down after its first serious error.

3. Harm Boundaries Must Be Hard

Governance policies come in two categories: preferences and boundaries.

Preferences are guidelines the system should follow but may deviate from when justified. “Keep pull requests under 300 lines” is a preference.
Boundaries are rules the system must never violate, regardless of context. “Never expose customer personal data in logs” is a boundary.

Hard boundaries must be enforced architecturally, not just instructionally. Telling a language model “never do X” is not enforcement — it is a suggestion. Enforcement means the tool layer physically cannot perform the action, or the governance middleware blocks it before execution.

Design hard boundaries for:

Data privacy violations.
Unauthorized access to systems.
Actions that bypass approval workflows.
Modifications to safety-critical systems without verification.
Resource consumption beyond defined limits.

4. The Human Must Remain in Control

Agency is delegated, not transferred. The human operator is always the ultimate authority. This means:

Kill switch: The operator can halt all system activity immediately.
Override: The operator can override any system decision.
Audit: The operator can review any action the system has taken.
Reconfiguration: The operator can change the system’s autonomy levels, policies, and boundaries at any time.

“The human is in control” is easy to say and hard to implement at scale. When an Agentic OS has dozens of active workers processing requests in parallel, what does “control” look like? It means:

Progress dashboards that show what is happening now.
Alert systems that surface anomalies in real time.
Approval queues that present decisions cleanly and allow batch processing.
Configuration interfaces that make policy changes immediate and system-wide.

Control that requires the operator to monitor every action defeats the purpose of agency. Control must be structural — built into the architecture so that the system’s default behavior is safe, and the operator intervenes only for exceptions.

5. Failure Must Be Visible

An agentic system that fails silently is more dangerous than one that fails loudly. When something goes wrong, the system must:

Detect the failure (through checking, validation, monitoring).
Report the failure (to the operator, in the audit log, through alerting).
Contain the failure (through circuit breakers, transaction rollback, sandboxing).
Learn from the failure (through memory updates, policy refinements).

The worst failure mode is one where the system does the wrong thing and everyone believes it did the right thing. This is why post-action validation, result verification, and human review points exist — not because the system is unreliable, but because even reliable systems occasionally fail, and undetected failures compound.

Design Patterns for Responsibility

Graduated Response

Instead of binary decisions (act or don’t act), use graduated responses:

Confidence > 95%: Act autonomously.
Confidence 70-95%: Act but flag for review.
Confidence 40-70%: Propose action, wait for approval.
Confidence < 40%: Report uncertainty, ask for guidance.

Thresholds are calibrated per domain. Production database changes might require 99% confidence for autonomous action. Documentation updates might require only 60%.

Red Team Process

For high-stakes systems, include a dedicated adversarial worker that reviews plans before execution:

“What could go wrong with this plan?”
“What assumptions are we making?”
“What is the worst-case outcome?”

The red team worker does not block execution for routine tasks. It activates when the risk assessment exceeds a threshold.

For actions that affect people (sending messages, modifying accounts, making commitments), the system verifies that the operator has consented to the type of action, not just the specific instance:

“You have authorized this system to send customer communications. This action sends an email to 450 customers about a service disruption. Proceed?”

The verification message is specific enough for informed consent — it includes the scope, the audience, and the impact.

Impact Logging

Beyond audit trails for compliance, maintain impact logs that track the real-world consequences of actions:

“Deployed v3.2.1 → Error rate decreased from 2.1% to 0.3% → 4 related support cases resolved.”
“Sent pricing update email → 12 replies received, 3 negative, 9 neutral → No escalations.”

Impact logs close the feedback loop between actions and outcomes, enabling the system to calibrate its confidence and the operator to evaluate the system’s value.

The Ethics of Delegation

When you delegate work to an agentic system, you are not delegating responsibility. The human who authorizes the system to act remains responsible for its actions. This has implications:

Operators must understand what they are authorizing. The system must present its capabilities and limitations clearly. An operator who does not understand the system’s behavior cannot meaningfully authorize it.
Organizations must define accountability structures. Who is responsible when the system makes a mistake? The operator who authorized it? The intent engineer who designed it? The organization that deployed it? These questions must be answered before the system is deployed, not after an incident.
The system must not obscure its nature. When an agentic system communicates with humans (customers, colleagues, external parties), the recipients should know they are interacting with an automated system unless there is a compelling and disclosed reason otherwise.

Building Trust

Trust is the currency of agency. A system with high trust operates with high autonomy, delivering maximum value. A system with no trust operates with no autonomy, delivering no value. Everything in this book — the governance plane, the staged autonomy model, the audit trail, the explainability requirements — is in service of one goal: building and maintaining trust.

Trust is built slowly and lost quickly. One unexplained failure, one policy violation, one data leak can reset trust to zero. This asymmetry is why responsible design is not a feature to be added — it is the foundation on which everything else depends.

Design for trust. Build for transparency. Default to safety. Escalate to humans. And never, ever deploy an agentic system that you would not want to explain to the person affected by its actions.

The Future of Operational Intelligence

This book began with a shift: from building programs that execute instructions to building systems that pursue goals. It described an architecture — the Agentic OS — that makes this shift systematic. It cataloged patterns that make the architecture reusable. It walked through domains where the architecture produces value. And it outlined the engineering discipline — intent engineering — that makes it all work.

This final chapter looks forward. Not to predict — prediction in this field has a miserable track record — but to identify the trajectories that are already in motion and the questions they raise.

Trajectory 1: The Disappearing Interface

Today, agentic systems have explicit interfaces: chat windows, CLI commands, API endpoints. The operator tells the system what to do.

The trajectory points toward ambient agency: systems that observe, anticipate, and act without being asked. The Coding OS notices a test is flaky and fixes it before you see the failure. The Knowledge OS detects that a document conflicts with a recent code change and resolves the inconsistency. The Support OS sees a spike in related tickets and proactively drafts a status page update.

This is not science fiction. Each of these examples is a straightforward application of the patterns in this book: monitoring (process fabric), detection (kernel), action (workers), and governance (approval for non-trivial actions).

The design question is not can we build this, but should it act without being asked? The staged autonomy model provides the answer: earn the right to proactive behavior through demonstrated reliability. Systems that have consistently fixed flaky tests correctly can be trusted to do so proactively. Systems with a poor track record cannot.

The interface does not disappear entirely. It transforms from a command interface to a supervision interface — the operator monitors, adjusts, and approves rather than instructs.

Trajectory 2: Organizational Intelligence

Today, agentic systems serve individuals or small teams. The Coding OS helps a developer. The Research OS helps an analyst. The coordination between systems, as described in Chapter 34, is nascent.

The trajectory points toward organizational intelligence: networks of Agentic OSs that collectively embody an organization’s operational capability. The engineering department’s Coding OS, the product team’s Research OS, the HR department’s Knowledge OS, and the finance team’s Compliance OS — all federated, all coordinated, all governed by organizational policies.

At this scale, the Agentic OS model shows its deepest value. An organization’s intelligence is not the sum of its individual tools — it is the coordination between them. The federation patterns, governance hierarchies, and shared memory planes described in this book are the infrastructure for this coordination.

The design challenge at this scale is not technical but organizational. Who governs the meta-OS? How are conflicts between departmental policies resolved? How is organizational learning captured and distributed? These are questions of organizational design expressed through system architecture.

Trajectory 3: Cross-Organization Collaboration

Beyond organizational intelligence lies inter-organizational collaboration. Your company’s Procurement OS negotiates with a supplier’s Sales OS. A hospital’s Clinical OS consults a pharmaceutical company’s Drug Information OS. A government agency’s Compliance OS audits a company’s Financial OS.

This trajectory requires solving problems that do not yet have good solutions:

Trust across boundaries: How does OS A trust that OS B is behaving honestly? Cryptographic attestation, auditable computation, and reputation systems are candidate approaches.
Semantic interoperability: How do OSs from different organizations understand each other’s intents? Standard ontologies, negotiation protocols, and translation layers are needed.
Regulatory compliance: When two OSs from different jurisdictions collaborate, which regulations apply? The governance plane must handle multi-jurisdictional policy evaluation.

This trajectory is the furthest from reality but the most transformative. It would fundamentally change how organizations interact — from human-mediated negotiation to system-mediated coordination with human oversight.

Trajectory 4: Learning Systems

Today’s agentic systems learn within a session (adapting plans based on feedback) and across sessions (storing memories for future use). But the learning is shallow: the system remembers what worked, not why it worked.

The trajectory points toward deep learning at the system level:

Strategy learning: “When dealing with microservice decomposition, start with data boundaries, not functional boundaries — this produces cleaner APIs in 73% of cases.” The system does not just record the strategy; it derives it from accumulated experience.
Calibration learning: “My confidence estimates are 15% too high for database migration tasks. Adjust.” The system learns to know what it does not know.
Preference learning: “This team values readability over performance in non-critical paths.” The system infers preferences from feedback patterns, not explicit configuration.
Failure pattern learning: “When a test fails after a dependency update, the root cause is usually a breaking change in the dependency’s API, not a bug in our code.” The system builds causal models of failures.

These learning capabilities transform the Agentic OS from a system that executes with accumulated knowledge to one that develops expertise. The difference is significant — expertise includes knowing when the rules do not apply, when to deviate from standard practice, and when to ask for help.

Trajectory 5: Composable Intelligence

Today, building an Agentic OS requires significant custom development — even with the patterns and building blocks described in this book. Each deployment is a bespoke system.

The trajectory points toward composable intelligence: a marketplace of interoperable components — kernels, process fabrics, memory systems, governance engines, skills, tools — that can be assembled into domain-specific operating systems with minimal custom work.

Imagine:

agentic-os init --template=coding-os
agentic-os add skill python-backend
agentic-os add skill react-frontend
agentic-os add tool github
agentic-os add tool jira
agentic-os add policy soc2-compliance
agentic-os configure governance --staged-autonomy
agentic-os deploy

This is the “Linux distribution” model applied to agentic systems. A common kernel with domain-specific packages. Standard interfaces between components. A package ecosystem with community contributions.

We are far from this vision, but every pattern in this book — standardized interfaces, pluggable components, declarative policies, skill packages — is a step toward it.

Open Questions

These trajectories raise questions that the field must answer:

How Do We Measure Alignment?

We can measure whether code compiles. We can measure whether tests pass. How do we measure whether a system’s actions align with its operator’s intent? Alignment is partly a technical problem (better evaluation frameworks) and partly a philosophical one (what does “intent” mean when it is underspecified?).

How Do We Handle Compounding Errors?

A single error in a single action is manageable. But agentic systems execute chains of actions, each building on the previous. A small error in step 3 may compound into a catastrophic error by step 30. How do we detect compounding errors before they compound? The checking phase of the execution loop helps, but it is not sufficient when the error is subtle.

How Do We Govern Systems That Govern Themselves?

The governance plane enforces policies. But who governs the governance plane? As agentic systems become more autonomous, the policies that govern them must become more sophisticated. At some point, the governance system itself may need agentic capabilities — a meta-governance OS. This recursion has no obvious stopping point.

How Do We Distribute Agency Fairly?

Agentic systems amplify capability. Those with access to powerful Agentic OSs will be dramatically more productive than those without. How do we ensure this amplification does not deepen existing inequalities? This is not a technology question — it is a social question — but the technology’s designers must consider it.

How Do We Preserve Human Skill?

When an agentic system handles tasks that humans used to do, the humans’ skills in that area may atrophy. A developer who never debugs because the Coding OS does it automatically becomes a developer who cannot debug when the system fails. How do we maintain human capability alongside system capability?

What Remains Constant

Amid all this change, some things remain constant:

The OS analogy holds. As agentic systems grow in complexity, the need for the abstractions described in this book — process management, memory management, governance, scheduling, isolation — only increases. The analogy is not a metaphor that will be outgrown. It is a structural insight that becomes more relevant as the systems become more capable.

Architecture matters. The difference between a well-architected agentic system and an ad hoc one will grow, not shrink. As capabilities increase, the systems without governance will be the ones that cause incidents. The systems without memory will repeat mistakes. The systems without proper boundaries will leak data.

Humans remain essential. The goal of the Agentic OS is not to replace human judgment but to amplify it. Humans set the intent, define the boundaries, evaluate the outcomes, and adjust the system. The human is not a bottleneck to be removed — the human is the purpose the system serves.

Closing

We are in the early days of a transformation as significant as the invention of the operating system itself. The first computers had no operating systems — programs were loaded manually, one at a time, with no isolation, no scheduling, no abstraction. The operating system made computers useful by managing complexity.

The first agentic systems have no operating systems. They are prompts chained together, with no governance, no memory management, no process isolation, no principled scheduling. They work, barely, for simple tasks. They fail unpredictably for complex ones.

The Agentic OS is the operating system for intelligence. It manages complexity so that the intelligence can focus on what matters: understanding intent, solving problems, and producing results.

The architecture is clear. The patterns are identified. The building blocks are available. What remains is the work — the engineering discipline to build these systems well, the governance wisdom to deploy them responsibly, and the vision to imagine what becomes possible when operational intelligence is not a novelty but infrastructure.

That infrastructure is what this book has described. Now build it.

Appendix A: Mapping to Today’s Stack

This appendix bridges the Agentic OS architecture to the tools, frameworks, and SDKs available today. The field moves fast — specific versions and APIs will change — but the mapping from abstract architecture to concrete technology categories is durable.

The goal is not to recommend a specific stack but to show how each architectural layer maps to real implementation choices, so you can evaluate your own tools against the model.

Cognitive Kernel

The kernel — intent routing, planning, decomposition, scheduling — maps to orchestration frameworks.

Frameworks

Framework	Kernel Capabilities	Best For
LangGraph (LangChain)	Stateful graph-based orchestration, conditional routing, tool calling, human-in-the-loop	Complex multi-step workflows with branching logic and state persistence
Semantic Kernel (Microsoft)	Planner with automatic function composition, plugin model, multi-model orchestration	.NET/Python systems needing planning over a plugin ecosystem
AutoGen (Microsoft)	Multi-agent conversation patterns, group chat orchestration, code execution	Research and prototyping of multi-agent systems
CrewAI	Role-based agent teams, task delegation, sequential and parallel workflows	Team-of-agents scenarios with defined roles (researcher, writer, reviewer)
OpenAI Assistants API	Built-in threads, tool use, code interpreter, file handling	Single-agent applications leveraging OpenAI’s managed infrastructure
Amazon Bedrock Agents	Managed agent orchestration, action groups, knowledge bases	AWS-native deployments needing managed agent infrastructure
Google ADK (Agent Development Kit)	Multi-agent systems, tool use, orchestration	Google Cloud-native agent systems

Implementation Guidance

Start with a single orchestration framework. The kernel abstraction does not require building a custom orchestrator from scratch. The reference implementations in Part VI use the Microsoft Agent Framework (Semantic Kernel) — it provides ChatCompletionAgent for individual workers, orchestration patterns (Sequential, Concurrent, Handoff, GroupChat) for coordination, and plugins (@kernel_function) for tool integration.

The kernel loop (perceive → interpret → plan → delegate → monitor → consolidate → adapt) maps to orchestration patterns. In Semantic Kernel, the orchestration type determines the coordination model:

flowchart LR
  subgraph AOS["Agentic OS Concept"]
    IR[Intent Router]
    PL[Planner]
    WK[Workers]
    Pipe[Pipeline]
    Fan[Fan-Out / Fan-In]
    AR[Adversarial Review]
    DR[Dynamic Routing]
    TO[Tools / Operators]
    GV[Governance]
  end
  subgraph SK["Semantic Kernel"]
    HO[HandoffOrchestration]
    PA[Agent with planning instructions]
    CCA[ChatCompletionAgent + scoped plugins]
    SO[SequentialOrchestration]
    CO[ConcurrentOrchestration]
    GCO[GroupChatOrchestration]
    HO2[HandoffOrchestration]
    PLG["Plugins (@kernel_function) + MCP"]
    FF["Function filters (on_function_invocation)"]
  end
  IR --> HO
  PL --> PA
  WK --> CCA
  Pipe --> SO
  Fan --> CO
  AR --> GCO
  DR --> HO2
  TO --> PLG
  GV --> FF

Process Fabric

The process fabric — worker lifecycle, sandboxing, isolation — maps to agent runtimes and execution environments.

Approaches

Approach	Isolation Level	Use When
In-process agents (LangGraph nodes, Semantic Kernel functions)	None — shared memory space	Trusted workers, low security requirements, maximum speed
Containerized workers (Docker, Kubernetes Jobs)	Process-level isolation	Workers need distinct dependencies, security boundaries, or resource limits
Serverless functions (AWS Lambda, Azure Functions)	Function-level isolation	Short-lived, stateless workers with burst scaling needs
Code sandboxes (E2B, Modal, Docker-in-Docker)	Full sandboxing	Workers execute untrusted code; security is critical
MCP Servers (Model Context Protocol)	Service-level isolation	Workers connect to external tools through a standardized protocol

Worker Contracts in Practice

The scoped worker contract maps to how you configure an agent’s system prompt, tools, and constraints:

# Semantic Kernel example
worker = kernel.create_agent(
    name="code_reviewer",
    instructions="Review the code diff for quality, security, and style issues.",
    tools=[file_read, git_diff, comment_create],  # scoped capabilities
    max_tokens=4000,                                # resource envelope
    temperature=0.1                                 # determinism preference
)

# LangGraph example
def code_review_node(state):
    """Scoped worker contract: reviews code with specific tools."""
    return model.invoke(
        messages=[SystemMessage(content=REVIEW_INSTRUCTIONS)],
        tools=[file_read, git_diff],  # capability scoping
    )

Memory Plane

The memory plane — working, episodic, semantic, procedural memory — maps to storage and retrieval systems.

Technology Mapping

Memory Tier	Technology Options	Key Considerations
Working Memory	In-context (prompt), state objects (LangGraph state, thread messages)	Limited by context window; assemble carefully
Episodic Memory	PostgreSQL, MongoDB, Redis (structured event storage)	Schema should capture: event, timestamp, outcome, metadata
Semantic Memory	Vector databases (Pinecone, Weaviate, Qdrant, Chroma, pgvector)	Embedding model choice affects retrieval quality; chunk size matters
Procedural Memory	Document stores, skill registries, instruction databases	Version controlled; retrievable by task type
Cross-session State	Redis, DynamoDB, PostgreSQL with JSONB	Must survive process restarts; keyed by session/user/project

Retrieval-Augmented Generation (RAG)

The Memory on Demand pattern maps directly to RAG. Implementation choices:

Embedding model: Use a model matched to your domain. OpenAI text-embedding-3-large, Cohere embed-v4, or open-source alternatives (BGE, E5).
Chunking strategy: Chunk by semantic unit (paragraph, function, section), not by fixed token count. Overlap chunks by 10-20% for context continuity.
Retrieval: Hybrid search (vector similarity + keyword BM25) outperforms either alone. Tools like LangChain retrievers, LlamaIndex, or direct vector DB queries implement this.
Reranking: After initial retrieval, rerank results with a cross-encoder (Cohere Rerank, Jina Reranker) to improve precision before inserting into the context window.

Memory in Multi-Agent Systems

When multiple agents need shared memory, use an Operational State Board backed by a shared data store:

# Shared state via LangGraph
class TaskState(TypedDict):
    plan: list[str]
    completed: list[str]
    findings: dict[str, str]
    blockers: list[str]

# All workers read from and write to this shared state
graph = StateGraph(TaskState)

Governance Plane

The governance plane — policies, permissions, audit, approval gates — maps to guardrails, observability, and authorization systems.

Technology Mapping

Governance Function	Technology Options
Policy enforcement (input/output)	Guardrails AI, NeMo Guardrails, custom middleware
Content filtering	Azure AI Content Safety, OpenAI Moderation API, Lakera Guard
Permission management	OPA (Open Policy Agent), Cedar (AWS), custom RBAC
Approval workflows	Slack/Teams integrations, custom approval UIs, human-in-the-loop nodes in LangGraph
Audit logging	Structured logging (OpenTelemetry), LangSmith, Arize Phoenix, Langfuse
Cost tracking	LLM provider dashboards, custom token counters, LangSmith cost tracking
Observability	LangSmith, Langfuse, Arize Phoenix, Weights & Biases Weave, OpenLLMetry

Implementation: Governance as Middleware

The governance middleware pattern maps to interceptors or callbacks that wrap tool and model calls:

# Guardrails as middleware (pseudo-code)
class GovernanceMiddleware:
    def before_tool_call(self, tool, args, worker_context):
        # Capability check
        if tool.name not in worker_context.allowed_tools:
            raise PermissionDenied(f"Worker lacks capability: {tool.name}")
        # Risk check
        if tool.risk_level == "high" and not worker_context.has_approval:
            return request_human_approval(tool, args)
        # Budget check
        if worker_context.budget_remaining <= 0:
            raise BudgetExhausted()
        # Audit
        log_action(tool, args, worker_context)

    def after_tool_call(self, tool, args, result, worker_context):
        # Output validation
        validate_output(result, tool.output_schema)
        # Audit
        log_result(tool, result, worker_context)

Human-in-the-Loop

Approval gates map to interrupt nodes in orchestration frameworks:

LangGraph: interrupt_before / interrupt_after on graph nodes. The graph pauses, persists state, and resumes after human approval.
Semantic Kernel: Filters and function invocation handlers that can pause execution.
Custom: Webhook-based approval flows that pause a task and resume on callback.

Tool & Skill Layer

Tools map to function calling, MCP servers, and API integrations.

Model Context Protocol (MCP)

MCP is emerging as the standard protocol for connecting agents to tools. Key properties:

Standardized interface: Tools expose capabilities through a uniform protocol (JSON-RPC over stdio or HTTP).
Discovery: Agents discover available tools through the MCP server’s capability listing.
Isolation: Each MCP server runs as an independent process with its own permissions.
Ecosystem: Growing registry of MCP servers for common integrations (GitHub, databases, file systems, web search).

MCP maps directly to the Operator Fabric’s tool registry pattern: tools declare their inputs, outputs, and capabilities; the kernel discovers and selects them at runtime.

Function Calling

All major model providers support function calling (tool use):

OpenAI: tools parameter with JSON schema definitions.
Anthropic: tools parameter with input schema.
Google (Gemini): Function declarations with parameter schemas.
Azure OpenAI: Same as OpenAI, with enterprise security layers.

Function calling is the atomic mechanism. MCP and orchestration frameworks build higher-level abstractions on top of it.

Model Provider Layer

The model provider abstraction maps to LLM APIs and routing layers.

Multi-Model Strategy

Task Type	Model Tier	Examples
Classification, routing	Small/fast	GPT-4.1 mini, Claude Haiku, Gemini Flash
Code generation, analysis	Medium	GPT-4.1, Claude Sonnet, Gemini Pro
Complex reasoning, planning	Large	Claude Opus, o3, Gemini Ultra
Embeddings	Embedding-specific	text-embedding-3-large, Cohere embed-v4

Model Routers

Tools for implementing the model provider abstraction:

LiteLLM: Unified API across 100+ LLM providers with fallback, load balancing, and cost tracking.
OpenRouter: Multi-provider routing with automatic failover.
Custom routing: Select model based on task type, cost budget, and required capabilities.

# Model selection based on task (pseudo-code)
def select_model(task_type, budget):
    if task_type == "classification":
        return "gpt-4.1-mini"  # fast, cheap
    elif task_type == "planning" and budget.allows("premium"):
        return "claude-opus-4"  # best reasoning
    elif task_type == "code_generation":
        return "claude-sonnet-4"  # strong code, moderate cost
    else:
        return "gpt-4.1"  # good default

Putting It Together: A Starter Architecture

For a team building their first Agentic OS, here is a practical starting stack:

Layer	Starting Choice	Why
Kernel	Semantic Kernel (Python)	Agent Framework with built-in orchestration patterns (Sequential, Handoff, GroupChat), plugin model, and multi-model support
Process Fabric	`ChatCompletionAgent` + E2B for code execution	Each agent is a scoped worker with its own plugins; E2B for sandboxed code execution
Memory	pgvector (semantic) + PostgreSQL (episodic)	Single database for both vector and structured storage
Governance	SK function filters + Langfuse (observability)	Filters enforce policies at every function call; Langfuse for tracing and cost tracking
Tools	SK Plugins (`@kernel_function`) + MCP servers	Plugins for local tools; MCP servers for isolated/external tools
Models	Azure OpenAI (via SK connectors)	Native SK integration; multi-model via service selection

This stack can be deployed as a single application initially and decomposed into services as scale demands.

What This Mapping Is Not

This appendix maps architecture to tools, not tools to architecture. Do not pick a tool and design the architecture around it. Design the architecture from the principles in this book, then select tools that implement each layer.

Tools change yearly. The architecture endures. If you find yourself locked into a specific framework’s patterns, you have coupled too tightly. The reference architecture’s layer boundaries exist precisely so that you can replace any tool without rebuilding the system.

Appendix B: Platform Landscape and Governance Standards

The agentic systems landscape is evolving rapidly. This appendix maps the current ecosystem — platforms, standards, and governance frameworks — to help practitioners orient their architectural decisions within the broader industry context.

This appendix will age faster than any other part of this book. Use it as a snapshot of the landscape at the time of writing and as a framework for evaluating new entries as they appear.

Agent Platforms

The market has stratified into distinct categories:

Foundation Model Providers with Agent Capabilities

These companies provide the underlying models and are adding agent infrastructure directly to their APIs.

Provider	Agent Offering	Strengths	Limitations
OpenAI	Assistants API, GPT Actions, Function Calling	Mature API, code interpreter, file handling, managed threads	Vendor lock-in; limited orchestration flexibility
Anthropic	Claude tool use, computer use, extended thinking	Strong reasoning, large context (200K+), careful safety design	No managed agent infrastructure; BYO orchestration
Google	Gemini + ADK (Agent Development Kit), Vertex AI Agents	Multi-modal, long context (2M), tight GCP integration	Ecosystem still maturing
Amazon	Bedrock Agents, action groups, knowledge bases	Multi-model support, AWS integration, managed infrastructure	AWS-centric; less flexibility for multi-cloud
Microsoft	Azure AI Agent Service, Copilot Studio	Enterprise integration (M365, Dynamics), Semantic Kernel	Complex licensing; enterprise-focused

Orchestration Frameworks

These are the open-source and commercial frameworks for building agent systems.

Framework	Architecture	Community	Production Readiness
LangGraph	Graph-based state machines	Large (LangChain ecosystem)	High — used in production by many companies
Semantic Kernel	Plugin-based with planners	Growing (Microsoft backing)	High — production-grade with enterprise support
AutoGen	Conversation-based multi-agent	Active research community	Medium — strong for research, evolving for production
CrewAI	Role-based agent teams	Growing rapidly	Medium — maturing quickly
LlamaIndex	Data-focused agent workflows	Large	High for RAG-centric applications
Haystack	Pipeline-based NLP/agent workflows	Established	High — production-tested

Agent Infrastructure

These platforms provide the runtime infrastructure for agent systems.

Platform	Focus	Key Capability
LangSmith	Observability, testing, evaluation	End-to-end tracing, prompt playground, dataset management
Langfuse	Open-source LLM observability	Self-hostable, cost tracking, prompt management
Arize Phoenix	LLM observability and evaluation	Traces, evaluations, embedding analysis
E2B	Code sandboxing	Secure code execution environments for agents
Modal	Serverless compute	GPU-enabled serverless for agent workloads
Weights & Biases Weave	Experiment tracking	LLM application monitoring and evaluation

Emerging Standards

Model Context Protocol (MCP)

Origin: Anthropic (open-sourced November 2024) Status: Rapidly adopted across the industry Purpose: Standardized protocol for connecting AI models to external tools and data sources

MCP is the most significant standardization effort in the agent tooling space. It maps directly to the Operator Fabric’s tool registry and tool invocation patterns:

Resources: Expose data to the agent (files, database records, API responses).
Tools: Expose actions the agent can take (create file, run query, send message).
Prompts: Expose reusable prompt templates.
Sampling: Allow servers to request LLM completions from the host.

Architectural significance: MCP decouples tool implementation from agent implementation. A tool built as an MCP server works with any MCP-compatible agent, regardless of the orchestration framework. This is the Operator Adapter pattern implemented as an industry standard.

OpenAI Function Calling Schema

Status: De facto standard adopted by most providers Purpose: Standard format for declaring tool schemas that models can invoke

Most model providers have converged on a JSON Schema-based format for function calling. This near-standard enables portable tool definitions:

{
  "name": "search_codebase",
  "description": "Search the codebase for files matching a pattern",
  "parameters": {
    "type": "object",
    "properties": {
      "query": { "type": "string", "description": "Search pattern" },
      "max_results": { "type": "integer", "default": 10 }
    },
    "required": ["query"]
  }
}

Agent-to-Agent Protocols

Status: Early stage Purpose: Standardized communication between independent agent systems

Several proposals are emerging for agent-to-agent communication:

Google A2A (Agent-to-Agent): Protocol for agent interoperability, discovery, and task delegation between independent agent systems.
AGNTCY / ACP (Agent Communication Protocol): Open standards initiative for inter-agent messaging.

These map to the Multi-OS Coordination patterns (Chapter 34) — federation bus, capability discovery, and cross-OS messaging. The standards are nascent, but the architectural patterns are stable.

Governance and Safety Standards

Regulatory Landscape

Regulation / Framework	Jurisdiction	Agent-Relevant Requirements
EU AI Act	European Union	Risk classification for AI systems; high-risk systems require conformity assessment, human oversight, transparency, and record-keeping
NIST AI RMF	United States	Risk management framework: govern, map, measure, manage. Voluntary but influential
ISO/IEC 42001	International	AI management system standard. Certification for responsible AI practices
Executive Order 14110	United States	Requirements for AI safety testing, red-teaming, and reporting for frontier models
Singapore AI Governance Framework	Singapore	Principles-based governance with practical implementation guidance

How the Agentic OS Maps to Regulatory Requirements

Regulatory Requirement	Agentic OS Component
Human oversight	Permission Gates, Human Escalation, Staged Autonomy
Transparency	Execution Journal, Auditable Action, Active Plan Board
Risk management	Risk-Tiered Execution, Policy-Aware Scheduler, Governance Plane
Record-keeping	Audit logging in the Governance Plane, Execution Journal
Robustness	Failure Containment, Recovery Process, Checkpoints and Rollback
Data governance	Memory Plane scoping, Capability-Based Access, data classification at boundaries

The Agentic OS architecture is not designed for compliance with any specific regulation. It is designed around principles — governance, transparency, accountability, isolation — that happen to align well with what regulators require. This is not coincidence; well-engineered systems and well-designed regulations both derive from the same insight: autonomous systems need structure.

Industry Safety Frameworks

Framework	Focus	Relevance
OWASP Top 10 for LLM Applications	Security vulnerabilities specific to LLM-powered applications	Directly applicable: prompt injection, data leakage, excessive agency, insecure plugins
MLCommons AI Safety Benchmarks	Standardized safety evaluations for AI models	Useful for evaluating model providers in the Model Provider Layer
Anthropic Responsible Scaling Policy	Framework for scaling AI capabilities safely	Informs Staged Autonomy and Risk-Tiered Execution patterns
MITRE ATLAS	Adversarial threat landscape for AI systems	Threat modeling for the Governance Plane

Practical Governance Checklist

For teams deploying agentic systems, a minimum governance implementation:

Audit trail: Every agent action is logged with timestamp, inputs, outputs, and authorization context.
Human-in-the-loop: Irreversible actions require human approval. The system can halt on demand.
Cost controls: Per-task and per-session budgets with automatic cutoff.
Capability scoping: Each worker has explicit tool permissions. No worker has unconstrained access.
Output validation: Generated outputs (code, communications, data modifications) are validated before delivery.
Incident response: A process exists for investigating and responding to agent misbehavior.
Data boundaries: Sensitive data is classified and scoped. Data does not leak across security boundaries.
Model evaluation: Regular evaluation of model outputs against quality and safety benchmarks.

Evaluation and Testing Ecosystem

Evaluation Frameworks

Framework	Purpose
LMSYS Chatbot Arena	Crowd-sourced model comparison via blind pairwise evaluation
HELM (Stanford)	Holistic evaluation of language models across scenarios and metrics
SWE-bench	Evaluating agent capability on real-world software engineering tasks
Agent-bench	Cross-environment benchmark for agent capabilities
Inspect AI (UK AISI)	Framework for evaluating AI system capabilities and safety properties

Testing Agentic Systems

Traditional software testing assumes deterministic behavior. Agentic systems require additional testing strategies:

Behavioral benchmarks: Curated test suites that evaluate the system’s behavior across representative scenarios, scored on multiple dimensions (correctness, safety, efficiency).
Regression detection: Compare system outputs before and after changes. Flag significant behavioral differences using LLM-as-judge evaluations.
Red teaming: Adversarial testing where evaluators attempt to cause the system to violate its governance policies, leak data, or produce harmful outputs.
Simulation testing: Run the system against simulated environments and users to test behavior at scale without real-world consequences.
Cost benchmarking: Track tokens consumed, latency, and monetary cost per task type. Detect efficiency regressions.

Navigating the Landscape

The number of tools, frameworks, and standards is overwhelming and growing. Three principles help navigate it:

Architecture over tools. Choose your architecture first (this book provides one). Then select tools that implement each layer. Do not let a tool’s capabilities define your architecture.
Standards over proprietary. Where standards exist (MCP for tools, OpenTelemetry for observability, JSON Schema for function calling), prefer them. They reduce lock-in and increase composability.
Governance from day one. Do not treat governance as a phase-two concern. Audit logging, cost controls, and capability scoping are cheap to implement early and expensive to retrofit. The regulatory landscape is tightening, not loosening.

The landscape will look different in a year. The principles will not.

Keyboard shortcuts

The Agentic OS