Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Support OS

Support is a domain defined by urgency, repetition, and empathy. Customers have problems they need solved now. Many of those problems are variations of the same underlying issues. And every interaction happens against the backdrop of a relationship — a customer’s history, their frustration level, their value to the business.

A Support OS turns these characteristics from challenges into advantages.

The Domain

Customer support has properties that map well to the Agentic OS model:

  • High volume, high repetition: A significant percentage of support requests are variations of known issues. An OS with memory can recognize and resolve them faster each time.
  • Clear resolution criteria: A support case is open or closed. The customer’s problem is solved or it is not. This gives the system unambiguous feedback.
  • Rich context: Customer data, product state, account history, past interactions, known issues, documentation — there is abundant context to inform resolution.
  • Escalation hierarchy: Simple issues are resolved by frontline support. Complex issues escalate to specialists. Critical issues escalate to engineering. This maps naturally to the staged autonomy model.
  • Time sensitivity: Support latency directly impacts customer satisfaction. Speed matters, but accuracy matters more — a fast wrong answer is worse than a slightly slower correct one.

Architecture

Cognitive Kernel

The Support OS kernel handles intents like:

  • “I can’t log in” → Authentication troubleshooting workflow.
  • “My data is missing” → Data integrity investigation.
  • “How do I configure X?” → Documentation retrieval and guided walkthrough.
  • “Your service is down” → Incident correlation and status communication.
  • “I want a refund” → Policy evaluation and fulfillment.

The kernel classifies support requests by:

  • Category: Account, billing, technical, feature request, complaint.
  • Urgency: Service down (critical), functionality broken (high), inconvenience (medium), question (low).
  • Complexity: Known issue with known fix (simple), known issue with variable fix (moderate), unknown issue (complex).
  • Sentiment: Frustrated, neutral, satisfied. This affects communication style, not resolution strategy.

Process Fabric

Support workers:

  • Triage Agent: Classifies the request, checks for known issues, gathers initial context. Fast — completes in seconds.
  • Resolver: Applies known solutions to known problems. Has access to the knowledge base, troubleshooting scripts, and account tools.
  • Investigator: Diagnoses unknown problems. Has access to logs, system state, and diagnostic tools. Operates with more autonomy and time.
  • Communicator: Crafts customer-facing responses. Adapts tone and detail level to the customer’s context and sentiment.
  • Escalator: When a case exceeds the system’s capability, prepares the escalation package — summary, investigation results, customer context — for a human agent.

Memory Plane

Support-specific memory:

  • Customer memory: Each customer’s history — past issues, resolutions, preferences, sentiment trends. “This customer had a billing issue last month that took three interactions to resolve. Handle with extra care.”
  • Issue knowledge base: Known issues, their symptoms, root causes, and resolutions. Indexed by symptoms for fast matching. “Error code 4012 → API rate limit exceeded → suggest upgrading plan or implementing backoff.”
  • Resolution patterns: What worked and what did not. “For login issues on mobile, resetting the session token resolves 78% of cases. Password reset resolves another 15%.”
  • Product state: Current system status, known outages, recent deployments, feature flags. “The payment service was deployed 2 hours ago — check for related issues.”

Governance

Support-specific policies:

  • Data access scoping: Workers can view customer data relevant to the issue but cannot export it, share it across cases, or retain it beyond the interaction.
  • Action limits: The system can reset sessions and resend verification emails autonomously. It cannot modify billing, issue refunds above a threshold, or access personal data without specific authorization.
  • Escalation triggers: Cases exceeding N minutes without resolution auto-escalate. Cases involving data loss auto-escalate. Cases from enterprise accounts get priority routing.
  • Tone policies: Responses must be professional, empathetic, and solution-oriented. No blame, no jargon, no conditional language that promises outcomes (“this will definitely fix it”).
  • Privacy compliance: All interactions are GDPR/CCPA-compliant. Customer data is not used for training. Conversations are retained per retention policy.

Workflow: Technical Support Case

flowchart TD
  I[1. Intake\nCustomer submits issue] --> T[2. Triage\nClassify, check context,\nknown issues, system state]
  T -->|match found| R[3. Resolution Attempt\nApply known fix]
  T -->|no match| INV[Investigate\nUnknown issue workflow]
  R --> V[4. Verification\nConfirm fix worked]
  V -->|success| C[5. Communication\nCraft customer response]
  V -->|failed| INV
  C --> L[6. Learning\nUpdate memory & patterns]

1. Intake

A customer submits: “My dashboard has been showing ‘loading’ for the past hour. I’ve tried refreshing and clearing cache.”

2. Triage

The triage agent processes in parallel:

  • Classify: Technical issue, high urgency, moderate complexity.
  • Customer context: Enterprise account, 2 years, 3 prior tickets (all resolved), no current sentiment flags.
  • Known issue check: Search the knowledge base for “dashboard loading” symptoms. Match found: Known issue #847 — dashboard timeout when data exceeds 10K rows, deployed fix in v3.2.1, but some accounts may need cache invalidation.
  • System state check: No current outages. Dashboard service healthy. But: a data migration ran 3 hours ago that may have affected row counts.

3. Resolution Attempt

The resolver has a match with high confidence:

  • Known issue #847 applies.
  • The customer’s account has 12K data rows (confirmed via account tools).
  • Cache invalidation is the likely fix.

The resolver executes the cache invalidation (Level 0 autonomy — this is a safe, reversible operation) and prepares a verification step.

4. Verification

The resolver checks: Is the dashboard now loading? It queries the customer’s session state. Dashboard loaded successfully in 2.3 seconds.

5. Communication

The communicator crafts the response:

“Hi [Name], I’ve identified and resolved the issue with your dashboard. It was caused by a data processing update that affected the dashboard cache for accounts with large datasets like yours. I’ve cleared the affected cache, and your dashboard should now load normally.

If you experience this again, please let us know — we’re monitoring to ensure it doesn’t recur.

Is there anything else I can help with?“

Note what happens here: the communicator does not say “we ran cache invalidation on your account” (too technical). It does not say “it was a known bug” (undermines confidence). It gives enough explanation to be satisfying, frames the resolution positively, and invites continued engagement.

6. Learning

The system updates its memory:

  • Customer memory: “Dashboard loading issue resolved via cache invalidation. Related to data migration. Time to resolution: 47 seconds.”
  • Resolution patterns: Known issue #847 resolution success rate updated.
  • Product state: “Data migration affected N accounts with large datasets. Cache invalidation required.”

Workflow: Unknown Issue

When the triage agent finds no known issue match, the workflow shifts:

1. Investigation

The investigator gets the case with a broader toolkit:

  • Access to application logs for the customer’s account.
  • Access to system metrics around the reported time.
  • Access to recent deployment history.
  • Access to similar past cases (even if they are not exact matches).

The investigator forms hypotheses and tests them:

  1. Check logs for errors → Found: timeout on database query at 14:23.
  2. Check database performance → Found: slow query on the analytics table.
  3. Check recent schema changes → Found: missing index added in last migration but not applied to this shard.

2. Escalation Decision

The investigator has identified the root cause (missing index), but applying the fix (running the migration on the affected shard) exceeds the system’s autonomy level. This is a Level 3 action — it requires human engineering approval.

3. Escalation Package

The escalator prepares a handoff for the engineering team:

  • Summary: Customer dashboard timeout caused by missing database index on shard 7.
  • Evidence: Log timestamps, slow query plan, migration history showing shard 7 was skipped.
  • Proposed fix: Run pending migration on shard 7. Estimated impact: 2 minutes of read-only mode on the shard.
  • Customer context: Enterprise account, high priority. Customer informed that engineering is investigating.
  • Suggested customer response: Draft communication ready for review.

The human engineer gets everything needed to act — diagnosis, evidence, proposed fix, and customer context — in one package. Their job is to verify and approve, not to re-investigate.

The Human-AI Handoff

The Support OS is designed around seamless human-AI collaboration:

  • AI handles: Triage, known issue resolution, data gathering, response drafting.
  • Humans handle: Novel problems, judgment calls, policy exceptions, relationship-sensitive situations.
  • The handoff includes: Full context, investigation results, customer history, and draft communications. The human agent never starts from zero.

The system tracks which cases humans handle and why. Over time, if 80% of human handoffs for a particular issue type result in the same resolution, that resolution becomes a known pattern and the system handles it autonomously.

Metrics

  • First-contact resolution rate: Cases resolved without human involvement.
  • Mean time to resolution: From intake to confirmed resolution.
  • Escalation rate: Percentage of cases requiring human involvement, trending over time.
  • Customer satisfaction: Post-interaction ratings correlated with resolution type (AI vs. human vs. hybrid).
  • Knowledge base growth: New known issues added per week, resolution pattern accuracy.

What Makes This an OS, Not a Chatbot

A support chatbot matches keywords to canned responses. A Support OS resolves problems: it triages, investigates, applies fixes, verifies results, communicates appropriately, learns from outcomes, and knows when to involve a human.

The OS model provides what chatbots lack: memory across interactions (the customer’s history), process management (investigation workflows), governance (data access policies, escalation rules), and learning (resolution pattern improvement). These are not chatbot features — they are system properties that emerge from the OS architecture.

Reference Implementation

The Support OS uses Semantic Kernel with a HandoffOrchestration — the natural pattern for support workflows where control transfers between triage, resolution, and escalation agents based on context.

Plugins: Support Tools

# plugins/support_tools.py
from typing import Annotated
from semantic_kernel.functions import kernel_function

class SupportPlugin:
    """Support operations with data governance built in."""

    @kernel_function(description="Search knowledge base for matching known issues.")
    async def search_known_issues(
        self, symptoms: Annotated[str, "Symptom description"]
    ) -> Annotated[str, "Matching known issues as JSON"]:
        results = await kb_vector_store.similarity_search(symptoms, k=5)
        return json.dumps([{
            "id": r.metadata["issue_id"],
            "title": r.metadata["title"],
            "resolution": r.metadata["resolution"],
            "confidence": r.metadata["score"],
        } for r in results])

    @kernel_function(description="Get customer context for a ticket (no PII export).")
    async def get_customer_context(
        self, ticket_id: Annotated[str, "Ticket ID"]
    ) -> Annotated[str, "Customer context as JSON"]:
        customer = await db.get_customer_for_ticket(ticket_id)
        # Governance: PII fields are NOT included
        return json.dumps({
            "plan": customer.plan,
            "tenure_months": customer.tenure_months,
            "past_tickets_count": len(customer.tickets),
            "recent_tickets": [
                {"date": str(t.created), "category": t.category}
                for t in customer.tickets[-5:]
            ],
        })

    @kernel_function(description="Invalidate cache for a customer account.")
    async def invalidate_cache(
        self,
        account_id: Annotated[str, "Account ID"],
        cache_type: Annotated[str, "Cache type to invalidate"],
    ) -> Annotated[str, "Confirmation"]:
        await cache_service.invalidate(account_id, cache_type)
        return f"Cache '{cache_type}' invalidated for account {account_id}"

    @kernel_function(description="Search application logs for a customer.")
    async def search_logs(
        self,
        query: Annotated[str, "Log search query"],
        account_id: Annotated[str, "Account ID"],
        hours: Annotated[int, "Hours to search back"] = 24,
    ) -> Annotated[str, "Matching log entries as JSON"]:
        logs = await log_service.search(
            query=query, account_id=account_id,
            start_time=datetime.now() - timedelta(hours=hours),
        )
        return json.dumps([
            {"timestamp": str(l.ts), "level": l.level, "message": l.message[:200]}
            for l in logs[:50]
        ])

Agents: Support Specialists

# agents/support_agents.py
from semantic_kernel.agents import ChatCompletionAgent
from plugins.support_tools import SupportPlugin

def create_triage_agent(service) -> ChatCompletionAgent:
    return ChatCompletionAgent(
        service=service,
        name="Triage",
        instructions="""You are a support triage specialist.
For each ticket: classify category and urgency, load customer context,
check for matching known issues.
If a known issue matches with >80% confidence, hand off to Resolver.
If the issue is critical or unknown, hand off to Investigator.
Always include customer context in your handoff.""",
        plugins=[SupportPlugin()],
    )

def create_resolver_agent(service) -> ChatCompletionAgent:
    return ChatCompletionAgent(
        service=service,
        name="Resolver",
        instructions="""You are a support resolver. Apply known fixes.
Execute safe, reversible operations (cache invalidation, session reset).
After applying a fix, verify it worked.
If the fix fails, hand off to Investigator.""",
        plugins=[SupportPlugin()],
    )

def create_investigator_agent(service) -> ChatCompletionAgent:
    return ChatCompletionAgent(
        service=service,
        name="Investigator",
        instructions="""You are a support investigator. Diagnose unknown issues.
Search logs, check system metrics, form hypotheses and test them.
If you identify the root cause and a fix is within your capabilities, fix it.
If the fix requires engineering intervention, hand off to Escalator.""",
        plugins=[SupportPlugin()],
    )

def create_communicator_agent(service) -> ChatCompletionAgent:
    return ChatCompletionAgent(
        service=service,
        name="Communicator",
        instructions="""You are a customer communication specialist.
Write empathetic, clear, solution-focused responses.
No jargon. No blame. Adapt tone to customer sentiment.
Include what happened, what was done, and next steps.""",
    )

Kernel: Handoff Orchestration

The support workflow uses HandoffOrchestration — control passes dynamically between agents based on the situation:

# agents/kernel.py
import asyncio
from semantic_kernel.agents import HandoffOrchestration, ChatCompletionAgent
from semantic_kernel.agents.runtime import InProcessRuntime
from semantic_kernel.connectors.ai.open_ai import AzureChatCompletion

async def handle_support_ticket(ticket_message: str) -> str:
    """
    Handoff orchestration: Triage → Resolver or Investigator → Communicator.
    The HandoffOrchestration pattern maps to the Support OS's
    dynamic routing based on issue classification.
    """
    service = AzureChatCompletion(
        deployment_name="gpt-4.1",
        endpoint="https://your-endpoint.openai.azure.com/",
    )

    triage = create_triage_agent(service)
    resolver = create_resolver_agent(service)
    investigator = create_investigator_agent(service)
    communicator = create_communicator_agent(service)

    # Handoff orchestration: agents transfer control dynamically
    orchestration = HandoffOrchestration(
        members=[triage, resolver, investigator, communicator],
    )

    runtime = InProcessRuntime()
    await runtime.start()

    result = await orchestration.invoke(
        task=f"Support ticket: {ticket_message}",
        runtime=runtime,
    )
    output = await result.get()

    await runtime.stop_when_idle()
    return output

Key patterns demonstrated: handoff orchestration (dynamic routing between triage/resolver/investigator), plugins with data governance (customer PII scoped out of responses), capability separation (resolver can execute fixes, investigator can search logs, communicator only writes text), and semantic search over known issues (vector store as episodic memory).


Try it yourself: The complete Support OS — agents (@triage, @resolver, @investigator, @communicator), skills (/triage-and-resolve, /escalation), escalation templates, and tutorial — is available at implementations/support-os/.