Why Every Voice App Wastes 280ms in Your Browser

Every speech-to-text tool that delivers text to your browser — Voice In, Voicy, Speechnotes, all of them — hits the same wall. The transcription finishes. The text is ready. Then the app writes it to the clipboard and simulates Cmd+V. We measured this end-to-end: ~280ms median per send. Every single time.

We just shipped a Chrome extension that eliminates it entirely.

The Clipboard Tax

Here is the standard delivery path for voice text reaching a browser text field:

STT Engine → clipboard.writeText() → osascript Cmd+V → OS paste event → DOM update

That osascript call alone takes 80-150ms. Add clipboard write, the OS event loop, and the browser processing the paste event, and you are looking at ~280ms of dead time after inference is already complete (median over 10 trials). On a fast STT engine that transcribes in 237ms, this means the delivery overhead is longer than the inference itself.

This is not a Callipso-specific problem. It is a platform limitation. Any macOS app that delivers text to a browser via the clipboard pays this tax. There is no way around it — unless you skip the clipboard entirely.

The Fast Path

Callipso Voice Bridge is a companion Chrome extension that receives transcribed text directly from the Callipso desktop app over a local WebSocket connection and inserts it into the focused element via document.execCommand('insertText').

STT Engine → WebSocket (ws://localhost:3000) → content script → execCommand('insertText')

No clipboard. No simulated keystrokes. No OS paste event. The text goes from our Electron process to the browser's DOM in 1-5ms.

The architecture is minimal:

Component	Role
`ChromeExtensionBridge` (Electron)	WebSocket server on `/chrome` path, single-connection model
Service worker (extension)	WebSocket client, relays messages to active tab
Content script (extension)	`execCommand('insertText')` with shadow DOM traversal

If the extension is not connected or the insertion fails, the system falls back to the standard Cmd+V path automatically. The 500ms ack timeout ensures voice delivery is never blocked.

Does Anyone Else Do This?

We searched. No one does desktop-app-to-browser voice bridging over WebSocket. The landscape:

Tool	Approach	Limitation
Voice In	Browser Web Speech API	STT runs in-browser, lower quality than local models
Voicy	Browser-based STT	Same — no desktop engine integration
Clipboard Inserter	Polls clipboard, auto-pastes	Crude, pollutes clipboard, no WebSocket
native-inserter	Chrome Native Messaging	Built for Japanese text extraction, not voice

The browser-based STT extensions are limited to the Web Speech API or their own cloud models. They cannot tap into local CoreML or Parakeet inference running on your Neural Engine. Callipso runs STT locally with models that transcribe 8 seconds of speech in 237ms — then the extension handles the last mile.

What You Get

~55x faster delivery. 1-5ms instead of ~280ms. The text appears the instant inference completes.

No clipboard pollution. Your clipboard stays exactly as it was. Copy something, dictate into a text field, paste — your original clipboard content is still there.

Works everywhere. GitHub issues, Google Docs, Gmail compose, Slack, Notion, Monaco editors, shadow DOM components. The content script traverses shadow roots and falls back gracefully for standard inputs and textareas.

Zero data off-machine. The WebSocket connection is ws://localhost:3000. No cloud relay, no analytics, no tracking. The extension's only job is to insert text it receives from your own desktop.

How Can I Get It?

Callipso Voice Bridge is live on the Chrome Web Store. Install it, launch Callipso, and voice text will route directly to your browser — no clipboard, no paste delay.

The 280ms clipboard tax was always an annoyance. Now it is optional.