Back to Blog
engineeringvoice-routingarchitecture

Voice Routing: A Technical Deep Dive

How Callipso routes voice commands from clipboard to terminal in under 200ms — the state machine, adapter architecture, and hotkey pipeline.

Callipso TeamJanuary 20, 20268 min read

Voice Routing: A Technical Deep Dive

The core of Callipso is deceptively simple: read text from the clipboard, send it to a terminal. But making that reliable across five different terminal hosts, with sub-200ms latency, while handling edge cases like stale clipboard content and concurrent sessions — that is where the engineering lives.

This post walks through the voice routing pipeline, from the moment you press a hotkey to the moment your words appear in a terminal.

The Pipeline

The voice routing pipeline has four stages:

STT Engine → Clipboard → Callipso VoiceRouter → Terminal Adapter → Terminal

Callipso is deliberately STT-agnostic. It does not care whether you use SuperWhisper, Whisper.cpp, macOS Dictation, Wispr Flow, or our built-in Parakeet engine. As long as your STT writes transcribed text to the system clipboard, Callipso can route it.

Stage 1: Hotkey Detection

When you press the configured hotkey (default: Cmd+Shift+V), Callipso's global hotkey listener fires. This is registered via Electron's globalShortcut API, which captures the keystroke even when Callipso is not in focus.

The hotkey triggers the VoiceRouter, which is implemented as a finite state machine with three states:

  • Idle — Waiting for hotkey press
  • Reading — Clipboard has been read, determining target
  • Routing — Sending text to the resolved terminal

Each state transition is logged and observable through the dev harness, making debugging straightforward.

Stage 2: Clipboard Read

The VoiceRouter reads the system clipboard via Electron's clipboard.readText(). But it does not blindly send whatever is there. Several checks run first:

  • Staleness detection: If the clipboard content is identical to what was routed last time, it is likely stale. Callipso can be configured to skip or warn in this case.
  • Content validation: Empty strings, extremely long strings, and binary content are filtered out.
  • Timestamp correlation: When Parakeet (the built-in STT) is active, Callipso compares the clipboard timestamp with the last STT event to ensure freshness.

Stage 3: Target Resolution

This is where things get interesting. Callipso needs to figure out which terminal should receive the text. The resolution strategy depends on your configuration:

Direct routing

The simplest mode. You designate a specific terminal (by IDE and index), and all voice input goes there.

Smart routing (Claude Code)

When routing to Claude Code, Callipso queries the SessionStore — a bidirectional map of SessionId to TerminalUUID. It finds the most appropriate session based on:

  1. Idle sessions — Sessions waiting for input get priority
  2. Most recent — Among idle sessions, the most recently active wins
  3. Auto-create — If no idle session exists, Callipso can spawn a new one

This is powered by the TerminalStoreManager, which continuously polls each IDE adapter for terminal state. The polling runs on a per-adapter cadence (typically 1-2 seconds) and uses diff-based updates to minimize overhead.

Focus-follows-voice

A mode where Callipso routes to whatever terminal is currently focused. Simple but effective for single-terminal workflows.

Stage 4: Terminal Delivery

Once the target terminal is resolved, the text is dispatched through the appropriate Application Adapter. Each supported IDE has its own adapter:

  • VS Code / Cursor / Windsurf — Uses the Callipso IDE extension, which exposes a local HTTP API on ports 3001-3003. The adapter sends a POST request with the terminal ID and text.
  • Terminal.app / iTerm2 — Uses AppleScript via osascript to send keystrokes to the terminal process.
  • Warp — Uses AppleScript with Warp-specific window targeting.
  • Claude Desktop — Routes through a dedicated coordinator that manages Claude Desktop's unique terminal model.

All adapters inherit from BaseAppAdapter, which provides common logic for terminal discovery, state polling, and the convertTerminals() normalization step. Each adapter only needs to implement the platform-specific pieces — typically sendText(), focusTerminal(), and the polling query.

Latency Budget

The entire pipeline needs to complete in under 200ms to feel instantaneous. Here is the typical breakdown:

StageTime
Hotkey detection~5ms
Clipboard read + validation~10ms
Target resolution~15ms
HTTP to IDE extension~30ms
Extension processes command~20ms
Total~80ms

The budget leaves plenty of headroom. Even in worst-case scenarios (cold adapter, session lookup miss), the pipeline stays well under 200ms.

Session Routing and Cross-Pollination

One of the trickiest bugs we encountered was "cross-pollination" — voice text arriving in the wrong Claude Code session when multiple sessions are running across different IDE windows.

The root cause was using SSE_PORT as the session identifier. In Cursor, all terminals in the same window share the same SSE port, making it impossible to distinguish between sessions.

The fix was to introduce SessionId — a UUID generated per Claude Code session via the hooks system. Each session registers its SessionId with the SessionStore, which maintains a bidirectional map to the terminal's TerminalUUID. Routing now uses SessionId for disambiguation, completely eliminating cross-pollination.

What We Learned

Building a reliable voice routing system taught us several lessons:

  1. Never trust the clipboard — It is a shared resource. Always validate freshness and content.
  2. Adapter abstraction pays off — Each terminal host has wildly different APIs. A clean adapter interface keeps the core routing logic simple.
  3. Bidirectional maps are worth the complexity — The SessionStore could have been a simple key-value map, but bidirectional lookup (session-to-terminal and terminal-to-session) eliminated an entire class of routing bugs.
  4. Observability is not optional — Every state transition in the VoiceRouter is logged. When something goes wrong, the answer is always in the logs.

The voice routing pipeline is the heart of Callipso. Everything else — the overlay UI, the Space visualizer, the forum — exists to support this core loop: speak, press, execute.

Share: