Back to Blog
engineeringvoice-routing

Same Model, 3x Faster: CoreML vs MLX for On-Device Speech Recognition

Same Parakeet 0.6B neural network. Two completely different execution paths. One runs through Python + HTTP + GPU. The other is a Swift binary on the Neural Engine. Here are the numbers.

Callipso TeamMarch 2, 20268 min read

Same Model, 3x Faster: CoreML vs MLX for On-Device Speech Recognition

Same neural network. Same weights. Same 6.34% word error rate. Same 25 languages. NVIDIA's Parakeet TDT 0.6B v3 model, running entirely on your Mac, with zero cloud dependency.

The difference is everything around the model.

One path goes through Python, HTTP, disk I/O, and the GPU. The other is a Swift binary that talks directly to the Neural Engine. We shipped both in Callipso and measured them on real recordings. The results are not subtle.

The MLX Pipeline: Count the Hops

When you record on the MLX backend, here is what happens:

MICROPHONE
  → AVAudioEngine captures 16kHz mono PCM
  → Audio written to WAV file in /tmp
  → WAV file read back, base64-encoded
  → HTTP POST to Python server (localhost:5001)
  → Python Flask server receives request
  → Decodes base64 → loads audio tensor
  → MLX inference on Apple Silicon GPU
  → JSON response over HTTP
  → Electron parses response
  → Text delivered to terminal

Count them. File write, file read, base64 encode, HTTP serialize, HTTP deserialize, base64 decode, tensor load. Seven hops between "audio captured" and "inference starts." Each one is fast. Together they add up.

The Python server itself is a permanent process. It cold-starts in ~7 seconds (loading the model into GPU memory), then stays resident at ~200MB RAM. A supervisor watches it — health checks, watchdog timers, restart logic. If it crashes, the supervisor spawns a new one and waits for it to warm up.

The CoreML Pipeline: Count the Hops

MICROPHONE
  → AVAudioEngine captures at native sample rate
  → Resample to 16kHz mono (in-memory)
  → CoreML inference on Neural Engine
  → JSON line to stdout
  → Electron reads stdout
  → Text delivered to terminal

Audio stays in memory from capture to inference. No file. No HTTP. No serialization. No server process to manage.

The Swift binary is spawned on demand — there is no persistent server. Electron runs parakeet-coreml progressive-record, the binary captures audio and transcribes, emits JSON lines, and exits when done. No watchdog. No health checks. No restart logic. 6.4MB binary, in and out.

Why the Neural Engine Wins

Every Apple Silicon chip has three compute units: CPU, GPU, and ANE (Apple Neural Engine). They are physically separate silicon blocks on the same die.

The GPU is shared. Your IDE uses it. Your terminal uses it. Electron's compositor uses it. The 3D Space tab visualization uses it. When the MLX backend runs inference on the GPU, it competes with everything else for the same resource.

The Neural Engine is dedicated ML silicon. It sits idle until you speak. When CoreML dispatches the Parakeet model with computeUnits = .all, macOS routes the compute-heavy operations to the ANE. The GPU stays free. The CPU stays free. Power draw drops from ~5-10W (GPU inference) to ~1-2W (ANE inference).

This is not a software optimization. It is a hardware arbitrage — using silicon that was sitting unused.

The Pipeline Comparison

DimensionMLX BackendCoreML Backend
LanguagePython 3Swift 5.9
RuntimeMLX frameworkCoreML + FluidAudio
Compute unitGPU (shared)Neural Engine (dedicated)
Server modelPersistent process (Flask)Spawned on demand
Audio transportWAV file → base64 → HTTPIn-memory PCM
Disk I/O1 WAV per chunkZero
HTTP round-trips5 per recordingZero
Cold start~7,000ms (model load)0ms (binary spawns instantly)
RAM overhead~200MB (Python + model)~50MB (in-process)
SupervisionWatchdog + health checks + restartNone needed
DependenciesPython, pip, MLX, Flask, ffmpegSingle static binary
Binary size~50MB (Python env)6.4MB

Real Timing Data

These numbers come from Callipso's built-in pipeline timing waterfall. Every recording generates a timing cascade that tracks each stage from hotkey press to text delivery. Here is one recording from today, an M1 Mac mini, CoreML backend:

HOTKEY PRESS ──────── 0ms
WINDOW SHOWN ──────── +0ms
AUDIO READY ────────── +36ms
RECORDER STARTED ──── +0ms
~~~ 9.6s speaking ~~~
STOP RECEIVED ──────── +9.6s
RECORDER STOPPED ──── +310ms
CHUNK 0 STT ────────── 258ms
FAST-PATH DELIVERY ── 25ms
TOTAL END-TO-END ──── 10.0s

The user spoke for 9.6 seconds. The CoreML backend transcribed the full chunk in 258ms. Fast-path delivery (binary stdout → Electron → terminal) added 25ms. Total overhead after the user stopped speaking: 593ms, most of which is the recorder stop latency, not inference.

Across 10 consecutive recordings on the same session:

MetricValue
Average STT inference265ms
Max STT inference334ms
Min STT inference212ms
Delivery pathFast-path (all 10)
Delivery overhead20-30ms

Every recording hit the fast-path — no fallback to clipboard, no HTTP upload, no intermediate file. Binary to Electron to terminal.

What About MLX?

The same model on the MLX backend, same hardware, same audio length:

MetricMLXCoreMLDelta
15s chunk inference638ms~97ms6.5x faster
First chunk (cold)1,934ms~97ms20x faster
Stop-to-text773ms~130ms6x faster

The MLX numbers include HTTP round-trip time and file I/O. The CoreML numbers are pure inference plus stdout delivery. Some of the gap is the Neural Engine being faster. Most of it is eliminating the ceremony around the model.

What the Timing Cascade Does Not Show Yet

Honest gap: the timing waterfall does not tag which backend produced each recording. All 20 entries in today's session look structurally identical — same fields, same cascade format. You can infer the backend from the STT latency (sub-300ms is almost certainly CoreML), but there is no explicit label.

This is a gap to fix for proper A/B benchmarking. When we add a backend field to the timing entries, we can automate comparisons across hundreds of recordings instead of eyeballing latencies.

Same Accuracy, Fewer Moving Parts

Both backends run the same model: NVIDIA Parakeet TDT 0.6B v3. Same architecture. Same weights (converted to CoreML format for the Swift path). Same word error rate. Same 25-language support.

What CoreML eliminates is not accuracy — it is operational surface area:

  • No Python runtime. No pip, no virtualenv, no dependency conflicts.
  • No HTTP server. No Flask, no request parsing, no connection management.
  • No file I/O. No WAV serialization, no /tmp cleanup, no base64 encoding.
  • No server supervisor. No health checks, no watchdog timers, no restart logic.
  • No GPU contention. The Neural Engine is dedicated silicon.

Less code means fewer failure modes. The MLX backend has a supervisor that monitors the Python server and restarts it if it crashes. The CoreML backend does not need a supervisor because there is no server. The process starts, transcribes, and exits. If it fails, the next recording spawns a fresh one.

The Model Landscape

Three CoreML variants of Parakeet exist today:

ModelLanguagesWERParametersSize
Parakeet CoreML v2English only6.05%0.6B~2.5GB
Parakeet CoreML v325 languages6.34%0.6B~2.5GB
Parakeet CoreML 110MEnglish onlyHigher110M~400MB

Callipso ships with v3 by default — same multilingual support as the MLX backend, slightly higher WER than the English-only v2. There is no small multilingual CoreML model yet. If you only need English and want the absolute best accuracy, v2 is marginally better.

All three models compile for your specific Neural Engine on first use (30-60 seconds). After that, every inference is under 100ms. The compiled model is cached permanently in ~/Library/Caches/.

Progressive Chunking Works on Both

This is the feature we care about most. Callipso's progressive chunking — where words appear in your terminal while you are still talking — works identically on both backends.

Every N seconds (configurable, default 15), the audio is cut and transcribed as an independent chunk. Each chunk gets a 2-second overlap buffer to prevent word splitting at boundaries. Results stream to the recording window in real time:

json
{"type":"chunk","index":0,"text":"fix the login bug","confidence":0.94,"elapsed_ms":52}
{"type":"chunk","index":1,"text":"please terminal two","confidence":0.91,"elapsed_ms":48}
{"type":"done","totalChunks":2,"full_text":"fix the login bug please terminal two"}

The difference is in the plumbing. MLX chunks go through HTTP. CoreML chunks go through stdout. The recording window does not care — it receives the same progressive-audio-chunk IPC event either way.

When to Use Which

Both backends are available in Callipso's Config tab. Switch with a radio button.

Use CoreML when:

  • You want the lowest latency (sub-300ms per chunk)
  • You are on macOS 14+ (Sonoma or later)
  • You want minimal resource usage (no persistent server)
  • You are running GPU-heavy workloads alongside Callipso (Space tab, IDE rendering)

Use MLX when:

  • You need the battle-tested path (months of production use)
  • You are on an older macOS version
  • You want the mature progressive pipeline with all edge cases handled
  • You are already running the Python environment for other tools

The model is the same. The accuracy is the same. The difference is how many layers sit between your voice and the Neural Engine.

Try It

CoreML Parakeet is available now. Open Config, select CoreML under STT Backend. The model downloads once (~2.5GB), compiles for your chip on first inference, and is cached permanently after that.

First inference: 30-60 seconds. Every inference after: under 100ms. No server to start. No Python to install. Just speak.

Share: