Clipboard Interleaving: Mixing Copied Text into Voice Recordings

You are dictating a task to your AI coding agent. Midway through, you Cmd+C a function signature from your editor. You keep talking. When you stop recording, the clipboard text should appear in your transcript at the moment you copied it, not appended at the end.

We built this into Callipso's progressive recording pipeline. Copied text merges into your voice transcript at the correct chronological position, based on elapsed recording time.

The Problem

Callipso records voice in progressive 15-second chunks. Each chunk is sent to the STT backend (CoreML or Parakeet), transcribed, and appended to the running transcript. Clipboard events do not fit this model — they arrive instantly, have no duration, and belong to no chunk.

A naive approach would append clipboard text at the end or insert it at the nearest chunk boundary. Both lose the chronological relationship between what you said and what you copied. The agent receiving this text would see a disconnected symbol name instead of a naturally embedded reference.

The Architecture

Cmd+C keypress (any app)
  → VoiceRouter.onCmdC() marks timestamp
  → Clipboard change detected within 100ms
  → VoiceRouter emits 'cmd-c-update' with text
  → EventBridge forwards via IPC to recording window
  → Recording window captures elapsedTimeSec
  → progressiveChunksUI stores { text, elapsedTimeSec, beforeChunkIndex }
  → On finalization: reconstructDisplayText() merges at correct word position

Detection. The VoiceRouter watches for Cmd+C keypresses and clipboard changes. If a clipboard change arrives within 100ms of a Cmd+C press, it is classified as a manual copy — preventing the system from re-ingesting its own STT output.

Timestamping. The recording window captures elapsedTimeSec (seconds since recording started) at the moment the clipboard event arrives. This is the only timing signal we need.

Storage. Each insertion is stored as { text, elapsedTimeSec, beforeChunkIndex } and held until finalization.

Time-Proportional Positioning

At finalization, for each completed chunk we find all clipboard insertions within that chunk's time range and split the transcribed text at the proportional word position:

fraction = (elapsedTimeSec - chunkStartSec) / chunkDurationSec
splitWordIndex = round(fraction * words.length)

javascript

for (const ins of insertionsInChunk) {
    const fraction = (ins.elapsedTimeSec - chunkStartSec) / chunkDurationSec;
    const splitWordIndex = Math.min(
        Math.round(fraction * words.length),
        words.length
    );

    if (splitWordIndex > lastWordIndex) {
        parts.push(words.slice(lastWordIndex, splitWordIndex).join(' '));
    }
    parts.push(clipText);
    lastWordIndex = splitWordIndex;
}

We split on word boundaries rather than characters because STT output is not character-stable (punctuation and capitalization vary between runs), and clipboard text is semantically a word-level insertion — you want it between words, not splitting one in half.

Seamless vs Newline Mode

Callipso offers two separator modes that control how clipboard text appears in the final transcript. The mode is stored per-recording and applied at merge time, so you can switch before finalizing.

Seamless mode inlines the clipboard text directly into the speech flow. The result reads as one continuous stream — ideal when the copied text is a symbol name or short reference that belongs inside a sentence.

Newline mode wraps the clipboard text with double newlines. This visually separates it from the surrounding speech — useful when the copied text is a multi-line code block, an error message, or anything that should stand apart.

Here is the same recording with both modes. The user says "I need to refactor," copies a function name at t=7.2s, then continues "because it handles too many edge cases."

Seamless mode output:

I need to refactor the calculateShippingCost function because it
handles too many edge cases

Newline mode output:

I need to refactor the

calculateShippingCost

function because it handles too many edge cases

And with a multi-line clipboard (the user copies an error stack at t=22s while describing the bug):

Seamless mode output:

The service crashes on startup with TypeError: Cannot read property
'port' of undefined at server.ts:42 at bootstrap.ts:15 and I think
it is because the config object is not initialized

Newline mode output:

The service crashes on startup with

TypeError: Cannot read property 'port' of undefined
    at server.ts:42
    at bootstrap.ts:15

and I think it is because the config object is not initialized

For short symbol names, seamless reads naturally. For multi-line content, newline mode preserves structure. The implementation is a single branch at merge time:

javascript

const clipText = (separatorMode === 'newlines')
    ? `\n\n${ins.text}\n\n`
    : ins.text;

Edge Cases

Insertions after all chunks. Cmd+C during the "tail audio" window (after the last chunk was sent, before stop) falls after lastChunkEndSec. These are appended at the end — they genuinely belong there.

Rapid successive copies. Three Cmd+C's in two seconds each get distinct elapsedTimeSec values and are positioned independently. The sort-then-iterate approach handles this without special casing.

Chunk overlap dedup. Progressive recordings use a 2-second audio overlap between chunks to prevent word-splitting at boundaries. Deduplication runs before clipboard interleaving, so the two systems never interfere.

Observability

When clipboard interleaving breaks, a ring buffer debug module tracks every event through the full chain. The dev harness exposes it:

bash

curl http://localhost:3110/dev/clipboard-progressive | jq

Returns insertion count, chunk count, separator mode, and the last N clipboard events with elapsed times.

What We Learned

Elapsed time is the only reliable signal. We considered chunk indices, sample counts, and system timestamps. Elapsed time from recording start is monotonic, easy to capture, and maps cleanly to the chunk timeline.

Proportional positioning is good enough. The fraction-based approach assumes uniform speech rate within a chunk. For 15-second chunks, the error is at most 2-3 words. For giving an AI agent context, that is more than sufficient.

Capture at event time, merge at finalize time. Early prototypes inserted clipboard text immediately. This conflicted with chunk deduplication — the overlap pass sometimes removed clipboard text matching words in the overlap region. Separating capture from merge eliminated the interaction entirely.

What Is Next

Two areas. First, multi-modal context — screenshots, file paths, and editor selections, not just text. The timestamped-insertion model generalizes to any discrete event during recording. Second, word-level timestamps from CoreML for exact placement instead of proportional estimation.

For now, clipboard interleaving handles the common case: you are talking to your terminal, you grab a symbol name or error message, and it shows up in the right place.

Recording your voice and your clipboard as a single stream.