How We Let AI Agents Debug Our Live Electron App

100% of Callipso's code is written by AI agents. Not most of it. All of it. The human designs, reviews, and approves — the agents implement, debug, and verify.

This setup only works if the agents can actually see what the app is doing. An AI that writes code but cannot inspect runtime state is flying blind. It ships something, hopes it works, and asks the human "did that fix it?" That is not a workflow. That is a coin flip.

So we built a system that gives AI agents full workshop access to the running Electron app. We call it the dev harness. It runs on port 3010, it is disabled in production, and it changed everything about how we build Callipso.

The Problem: Electron Apps Are Black Boxes

Electron has a split-brain architecture. The main process (Node.js) and the renderer process (Chromium) communicate through IPC channels. When an AI agent writes a new handler, it has no idea whether the handler actually works until a human opens the app and clicks something.

Callipso has 480 IPC channels across 430+ handlers. Terminals, voice routing, session management, 3D visualization, recording pipelines — all coordinated through IPC. A bug in any handler can silently break features that seem unrelated.

The traditional debugging loop looks like this:

Agent writes code → Agent asks human to test → Human clicks around →
Human reports what happened → Agent guesses what went wrong → Repeat

Every iteration costs minutes. Multiply that by hundreds of handlers and the process becomes unsustainable.

The Solution: 138 Lines of IPC Proxy

The core of the dev harness is a thin HTTP proxy that auto-exposes every IPC channel as an HTTP endpoint. The entire proxy is 138 lines of TypeScript.

POST /ipc/invoke/store:getState      → calls the real handler, returns the result
POST /ipc/invoke/get-app-version     → same pattern, any channel
POST /ipc/send/save-hotkeys          → fire-and-forget variant
GET  /ipc/channels                   → lists all 480 channels
GET  /ipc/stats                      → timing and error stats per handler

When a new IPC channel is added to our channel registry, it instantly appears as an HTTP endpoint. Zero maintenance. Zero configuration. The proxy reads from the same channel definitions that the preload script uses, so it is always in sync.

The critical design decision: the proxy calls through the real IPC pipeline. It does not mock anything. An HTTP request to /ipc/invoke/store:getState triggers this full chain:

HTTP request (port 3010)
  → executeJavaScript in renderer
    → window.api.invoke('store:getState')
      → preload validates channel
        → ipcRenderer.invoke()
          → ipcMain.handle()
            → actual handler code
              → response envelope

This means every call through the dev harness tests the entire stack: serialization, preload validation, handler logic, and response formatting. Bugs caught here are real bugs, not artifacts of a test environment.

AI-Debuggable Response Envelopes

Early on, our handlers returned raw values. An empty array [] could mean "no results found" or "the subsystem has not initialized yet." An AI agent cannot tell the difference.

We adopted a response envelope that solves this:

typescript

// Success
{ ok: true, data: [...] }

// Failure — with recovery instructions
{
  ok: false,
  code: 'NOT_INITIALIZED',
  detail: 'ArchiveManager not available — app may still be starting up',
  recovery: 'Wait 2s and retry. Check /dev/state-full for init status.',
  data: []
}

The recovery field is the key innovation. It tells the AI agent exactly what to do next. Not "something went wrong" — but "check this specific endpoint, then retry." The agent can self-correct without human intervention.

Every handler in Callipso now uses this envelope. The instrumentedHandleEnvelope wrapper automatically catches errors and formats them as structured failures. The seven standard error codes (NOT_INITIALIZED, NOT_FOUND, VALIDATION, UNAUTHORIZED, RUNTIME, CONFLICT, LIMIT_EXCEEDED) are consistent across the entire codebase.

Stytch published a principle that captures this well: "If an AI agent cannot figure out how your API works, neither can your users." We agree. Verbose, structured errors are not just for AI — they make the entire system more debuggable for everyone.

State Aggregation: The Full Picture in One Call

Beyond individual IPC channels, the dev harness provides aggregation endpoints that combine multiple data sources:

GET /dev/state-full         → 15 sources in one response
GET /dev/store-state        → terminal store + session maps
GET /dev/renderer-state     → sync state + module health
GET /dev/coordinator-state  → all coordinator internals
GET /dev/logs?limit=50      → recent logs with filters

/dev/state-full returns overlay state, voice router state, poller health, active IDE, STT mode, worktree list, per-adapter stats, session maps, module health, and DI container tokens — all in a single JSON response. An AI agent can take one snapshot and understand the entire system state.

There is also /test/execute, which runs arbitrary JavaScript in the renderer context:

bash

curl -X POST http://localhost:3010/test/execute \
  -H "Content-Type: application/json" \
  -d '{"js": "document.querySelectorAll(\".terminal-list-item\").length"}'

This endpoint is more important than it looks. In software engineering terminology, it enables invasive in-process introspection — the injected code runs inside the live application's memory space, with full access to every variable, every DOM element, every internal object. The agent is not observing the app from outside. It is reaching into the running process.

Two Testing Patterns: In-Process vs External

The dev harness actually enables two distinct debugging patterns, each with established terminology in the industry.

Pattern 1: Invasive In-Process Introspection

The agent writes a JavaScript snippet, sends it to /test/execute, and it executes inside the running Electron renderer. The code shares memory with the app. It can inspect internal state that no API endpoint exposes — closure variables, DOM structure, module-level caches, anything.

bash

# This code runs INSIDE the live app's process
curl -X POST http://localhost:3010/test/execute \
  -d '{"js": "(function() { return { terminalCount: document.querySelectorAll('.terminal-list-item').length, activeTab: document.querySelector('.tab-btn.active')?.dataset.tab } })()"}'

Microsoft Research calls this interactive debugging. Their Debug2Fix paper (February 2026) showed that weaker AI models with in-process debugging tools matched or exceeded stronger models without them. A separate study, debug-gym, found 30-182% improvement on real bug-fixing benchmarks when AI agents could evaluate expressions inside running processes. The tool matters more than the model.

The academic term for the loop is Inspect-Perturb-Validate (from the InspectCoder paper): inspect internal state, change something, verify the result — all in-process.

Pattern 2: Synthetic Probing

The agent writes a test script as a separate file, runs it as its own process, and the script hits the dev harness over HTTP. The test code never enters the app's memory. It can only see what the endpoints expose.

bash

# This code runs OUTSIDE the app, probing it over HTTP
node /tmp/test-app-consistency.js  # hits port 3010 endpoints

Martin Fowler calls this out-of-process component testing. In the observability world, it is called synthetic monitoring — proactive tests that simulate interactions and validate responses.

Why Both Matter

Dimension	In-Process Introspection	Synthetic Probing
What it sees	Everything — all internal state	Only what endpoints expose
Where it runs	Inside the app's memory	In a separate process
Diagnostic power	Can evaluate any expression	Can only observe inputs/outputs
Safety	Can cause side effects	Cannot corrupt internal state
Survives refactoring	No — coupled to internals	Yes — tests the external contract
Best for	Diagnosing root causes	Catching regressions

The combination of both is called grey-box testing — partial internal access plus external verification. Our dev harness supports both patterns through a single system, which is uncommon.

Emergent Agent Behavior

Something unexpected emerged from this architecture. During complex multi-step tasks, the AI agent started proactively creating verification snippets and executing them against the live app between implementation steps — without being asked to.

The agent would write a JavaScript test to a temporary file, inject it into the running app via /test/execute (in-process introspection), read the result, adjust its approach, and continue. It wove verification into its reasoning process, treating the dev harness as a thinking tool rather than a testing tool.

This was not designed. We built the endpoints, the structured responses, the aggregation views — and the behavior emerged. The agent found the most effective way to use what was available.

Microsoft's harness engineering research supports this observation: agents with access to interactive debugging develop their own verification strategies. The harness does not need to prescribe a workflow. It needs to provide sufficient observability surface, and the agent figures out the rest.

Dev Harness vs MCP Server vs API

These terms are easy to conflate. Here is how they relate:

An API server is any HTTP server that accepts requests and returns responses. Callipso's Express server on port 3000 is an API server. The dev harness on port 3010 is also an API server. Curl, JSON, REST — standard HTTP.

An MCP server (Model Context Protocol) is a newer standard designed specifically for AI agent communication. It adds three capabilities on top of HTTP: tool discovery (the agent asks "what can you do?"), resources (read-only data feeds), and prompt templates. MCP clients like Claude Desktop or Cursor can connect to any MCP server and automatically discover its capabilities.

A dev harness is our term for the port 3010 server. It is an API server that exposes app internals for debugging. Not a standard term — just ours.

The relationship between them is layered:

MCP Server     (standardized discovery + descriptions)
    ↓ wraps
HTTP API       (the actual endpoints)
    ↓ calls
App Internals  (IPC handlers, state managers, etc.)

MCP servers typically wrap existing HTTP APIs to make them discoverable by AI tools. For our use case — Claude Code debugging via terminal and curl — the HTTP dev harness is more direct. An MCP layer would add value if we wanted other AI tools to introspect the running app without writing curl commands.

The dev harness's /ipc/channels endpoint already provides a form of tool discovery — it lists all 480 channels with their types and handler stats. It is functionally similar to MCP's tools/list, just over plain HTTP.

Zero Maintenance Cost

This is the part that sounds too good to be true, so it is worth explaining precisely.

The IPC proxy does not maintain its own list of channels. It reads from channelDefinitions.ts — the same file the app's preload script uses to validate IPC calls. The proxy is a mirror, not a copy. When a developer adds a new IPC channel to channelDefinitions.ts, the proxy picks it up automatically. No second file to update. No configuration. No registration step.

The app has grown from roughly 400 channels to over 500. The proxy has not been edited once since it was written. It is 138 lines and it will still be 138 lines when we reach 1,000 channels.

The first thing an AI agent can do when it connects to the dev harness is ask "what can you do?"

bash

curl http://localhost:3010/ipc/channels
# Returns: { count: 506, channels: ["store:getState", "get-app-version", ...] }

That single call returns every available channel — names and types. The agent does not need documentation. It does not need to have seen these channels before. It discovers them at runtime, picks the ones relevant to its task, and calls them.

This is the same pattern used by MCP (tools/list), by Claude Code's skill system, and by OpenAPI's schema endpoint. The underlying principle is runtime tool discovery — the agent asks the system what is available rather than relying on pre-written docs that can go stale.

Disposable Tests, Not Maintained Tests

The verification tests the agent creates are throwaway. They are written to /tmp/, executed once, and discarded. They are never committed to the codebase. There is nothing to keep in sync.

This works because:

The endpoints are the source of truth. The agent does not test against hardcoded expectations. It asks the live app "what is your state right now?" and reasons about the answer.
Tests are generated fresh each time. If the app changes, the agent generates a different test. If a channel is renamed, the agent discovers the new name via /ipc/channels.
Discovery replaces documentation. Instead of reading docs about what channels exist, the agent calls the discovery endpoint and gets the current list.

Compare this to traditional test suites where test-terminal-state.spec.ts is committed, and every change to the terminal store requires updating that test file. That is maintenance cost. The dev harness approach has none of it — the agent generates probes on the fly and throws them away.

Scaling Characteristics

More channels do not add maintenance. Going from 100 to 500 IPC channels required zero changes to the proxy. The overhead is one O(1) set lookup per request.

Aggregation endpoints scale better than raw channels. /dev/state-full pulling 15 sources in one call is more useful than 15 individual IPC invocations. The agent gets the full picture without correlating multiple responses.

Handler instrumentation is automatic. Every handler wrapped with instrumentedHandleEnvelope gets timing percentiles (p50, p95, p99), error rates, and call counts — discoverable from /ipc/stats with no manual setup.

The practical limit is not the number of channels — it is the quality of the response envelopes. A handler that returns {ok: false, code: 'NOT_INITIALIZED', recovery: 'check /dev/state-full'} is worth ten handlers that return raw null.

What We Learned

The harness matters more than the model. Martin Fowler coined the term "harness engineering" based on OpenAI's experience: agent effectiveness is bounded by what it can see and verify. A smarter model with no observability is less effective than a simpler model with full app introspection. Debug2Fix proved this empirically — the debugging tools outweighed the model's raw capability.
In-process introspection is the high-leverage pattern. Synthetic probing (hitting HTTP endpoints) is useful, but invasive in-process introspection (injecting code into the live app) is where the step change happens. The ability to evaluate arbitrary expressions inside the running process gives the agent diagnostic power that no external API can match.
Design for agent consumption. Verbose error messages with explicit recovery paths. Structured responses with discriminated unions. Machine-readable state endpoints. Stytch's principle applies: "If an AI agent cannot figure out how your API works, neither can your users."
Real pipelines, not mocks. The dev harness calls through the actual IPC stack. Bugs caught here are real bugs. Mocks give false confidence.
Grey-box testing emerges naturally. When you give agents both in-process and external tools, they combine them organically — inspecting internals to diagnose, then verifying through the external interface. You do not need to prescribe this. Provide the surface and the agent figures out the workflow.
Dev-only is non-negotiable. The dev harness exposes everything — arbitrary JS execution, full state dumps, IPC handler introspection. This is a development tool. It is completely disabled in production builds (!app.isPackaged).

The entire dev harness — IPC proxy, state aggregation, logging, control endpoints — is under 650 lines of TypeScript. It gives AI agents the same level of access that a human developer gets from Chrome DevTools, but through an interface agents can actually use: HTTP and JSON.

We think this is the future of AI-assisted development. Not smarter models, but better harnesses.

How We Let AI Agents Debug Our Live Electron App

The Problem: Electron Apps Are Black Boxes

The Solution: 138 Lines of IPC Proxy

AI-Debuggable Response Envelopes

State Aggregation: The Full Picture in One Call

Two Testing Patterns: In-Process vs External

Pattern 1: Invasive In-Process Introspection

Pattern 2: Synthetic Probing

Why Both Matter

Emergent Agent Behavior

Dev Harness vs MCP Server vs API

Zero Maintenance Cost

Auto-Discovery: The Agent Reads the Menu

Disposable Tests, Not Maintained Tests

Scaling Characteristics

What We Learned