Same Model, 3x Faster: CoreML vs MLX for On-Device Speech Recognition
Same Parakeet 0.6B neural network. Two completely different execution paths. One runs through Python + HTTP + GPU. The other is a Swift binary on the Neural Engine. Here are the numbers.
Same Model, 3x Faster: CoreML vs MLX for On-Device Speech Recognition
Same neural network. Same weights. Same 6.34% word error rate. Same 25 languages. NVIDIA's Parakeet TDT 0.6B v3 model, running entirely on your Mac, with zero cloud dependency.
The difference is everything around the model.
One path goes through Python, HTTP, disk I/O, and the GPU. The other is a Swift binary that talks directly to the Neural Engine. We shipped both in Callipso and measured them on real recordings. The results are not subtle.
The MLX Pipeline: Count the Hops
When you record on the MLX backend, here is what happens:
MICROPHONE
→ AVAudioEngine captures 16kHz mono PCM
→ Audio written to WAV file in /tmp
→ WAV file read back, base64-encoded
→ HTTP POST to Python server (localhost:5001)
→ Python Flask server receives request
→ Decodes base64 → loads audio tensor
→ MLX inference on Apple Silicon GPU
→ JSON response over HTTP
→ Electron parses response
→ Text delivered to terminal
Count them. File write, file read, base64 encode, HTTP serialize, HTTP deserialize, base64 decode, tensor load. Seven hops between "audio captured" and "inference starts." Each one is fast. Together they add up.
The Python server itself is a permanent process. It cold-starts in ~7 seconds (loading the model into GPU memory), then stays resident at ~200MB RAM. A supervisor watches it — health checks, watchdog timers, restart logic. If it crashes, the supervisor spawns a new one and waits for it to warm up.
The CoreML Pipeline: Count the Hops
MICROPHONE
→ AVAudioEngine captures at native sample rate
→ Resample to 16kHz mono (in-memory)
→ CoreML inference on Neural Engine
→ JSON line to stdout
→ Electron reads stdout
→ Text delivered to terminal
Audio stays in memory from capture to inference. No file. No HTTP. No serialization. No server process to manage.
The Swift binary is spawned on demand — there is no persistent server. Electron runs parakeet-coreml progressive-record, the binary captures audio and transcribes, emits JSON lines, and exits when done. No watchdog. No health checks. No restart logic. 6.4MB binary, in and out.
Why the Neural Engine Wins
Every Apple Silicon chip has three compute units: CPU, GPU, and ANE (Apple Neural Engine). They are physically separate silicon blocks on the same die.
The GPU is shared. Your IDE uses it. Your terminal uses it. Electron's compositor uses it. The 3D Space tab visualization uses it. When the MLX backend runs inference on the GPU, it competes with everything else for the same resource.
The Neural Engine is dedicated ML silicon. It sits idle until you speak. When CoreML dispatches the Parakeet model with computeUnits = .all, macOS routes the compute-heavy operations to the ANE. The GPU stays free. The CPU stays free. Power draw drops from ~5-10W (GPU inference) to ~1-2W (ANE inference).
This is not a software optimization. It is a hardware arbitrage — using silicon that was sitting unused.
The Pipeline Comparison
| Dimension | MLX Backend | CoreML Backend |
|---|---|---|
| Language | Python 3 | Swift 5.9 |
| Runtime | MLX framework | CoreML + FluidAudio |
| Compute unit | GPU (shared) | Neural Engine (dedicated) |
| Server model | Persistent process (Flask) | Spawned on demand |
| Audio transport | WAV file → base64 → HTTP | In-memory PCM |
| Disk I/O | 1 WAV per chunk | Zero |
| HTTP round-trips | 5 per recording | Zero |
| Cold start | ~7,000ms (model load) | 0ms (binary spawns instantly) |
| RAM overhead | ~200MB (Python + model) | ~50MB (in-process) |
| Supervision | Watchdog + health checks + restart | None needed |
| Dependencies | Python, pip, MLX, Flask, ffmpeg | Single static binary |
| Binary size | ~50MB (Python env) | 6.4MB |
Real Timing Data
These numbers come from Callipso's built-in pipeline timing waterfall. Every recording generates a timing cascade that tracks each stage from hotkey press to text delivery. Here is one recording from today, an M1 Mac mini, CoreML backend:
HOTKEY PRESS ──────── 0ms
WINDOW SHOWN ──────── +0ms
AUDIO READY ────────── +36ms
RECORDER STARTED ──── +0ms
~~~ 9.6s speaking ~~~
STOP RECEIVED ──────── +9.6s
RECORDER STOPPED ──── +310ms
CHUNK 0 STT ────────── 258ms
FAST-PATH DELIVERY ── 25ms
TOTAL END-TO-END ──── 10.0s
The user spoke for 9.6 seconds. The CoreML backend transcribed the full chunk in 258ms. Fast-path delivery (binary stdout → Electron → terminal) added 25ms. Total overhead after the user stopped speaking: 593ms, most of which is the recorder stop latency, not inference.
Across 10 consecutive recordings on the same session:
| Metric | Value |
|---|---|
| Average STT inference | 265ms |
| Max STT inference | 334ms |
| Min STT inference | 212ms |
| Delivery path | Fast-path (all 10) |
| Delivery overhead | 20-30ms |
Every recording hit the fast-path — no fallback to clipboard, no HTTP upload, no intermediate file. Binary to Electron to terminal.
What About MLX?
The same model on the MLX backend, same hardware, same audio length:
| Metric | MLX | CoreML | Delta |
|---|---|---|---|
| 15s chunk inference | 638ms | ~97ms | 6.5x faster |
| First chunk (cold) | 1,934ms | ~97ms | 20x faster |
| Stop-to-text | 773ms | ~130ms | 6x faster |
The MLX numbers include HTTP round-trip time and file I/O. The CoreML numbers are pure inference plus stdout delivery. Some of the gap is the Neural Engine being faster. Most of it is eliminating the ceremony around the model.
What the Timing Cascade Does Not Show Yet
Honest gap: the timing waterfall does not tag which backend produced each recording. All 20 entries in today's session look structurally identical — same fields, same cascade format. You can infer the backend from the STT latency (sub-300ms is almost certainly CoreML), but there is no explicit label.
This is a gap to fix for proper A/B benchmarking. When we add a backend field to the timing entries, we can automate comparisons across hundreds of recordings instead of eyeballing latencies.
Same Accuracy, Fewer Moving Parts
Both backends run the same model: NVIDIA Parakeet TDT 0.6B v3. Same architecture. Same weights (converted to CoreML format for the Swift path). Same word error rate. Same 25-language support.
What CoreML eliminates is not accuracy — it is operational surface area:
- No Python runtime. No pip, no virtualenv, no dependency conflicts.
- No HTTP server. No Flask, no request parsing, no connection management.
- No file I/O. No WAV serialization, no /tmp cleanup, no base64 encoding.
- No server supervisor. No health checks, no watchdog timers, no restart logic.
- No GPU contention. The Neural Engine is dedicated silicon.
Less code means fewer failure modes. The MLX backend has a supervisor that monitors the Python server and restarts it if it crashes. The CoreML backend does not need a supervisor because there is no server. The process starts, transcribes, and exits. If it fails, the next recording spawns a fresh one.
The Model Landscape
Three CoreML variants of Parakeet exist today:
| Model | Languages | WER | Parameters | Size |
|---|---|---|---|---|
| Parakeet CoreML v2 | English only | 6.05% | 0.6B | ~2.5GB |
| Parakeet CoreML v3 | 25 languages | 6.34% | 0.6B | ~2.5GB |
| Parakeet CoreML 110M | English only | Higher | 110M | ~400MB |
Callipso ships with v3 by default — same multilingual support as the MLX backend, slightly higher WER than the English-only v2. There is no small multilingual CoreML model yet. If you only need English and want the absolute best accuracy, v2 is marginally better.
All three models compile for your specific Neural Engine on first use (30-60 seconds). After that, every inference is under 100ms. The compiled model is cached permanently in ~/Library/Caches/.
Progressive Chunking Works on Both
This is the feature we care about most. Callipso's progressive chunking — where words appear in your terminal while you are still talking — works identically on both backends.
Every N seconds (configurable, default 15), the audio is cut and transcribed as an independent chunk. Each chunk gets a 2-second overlap buffer to prevent word splitting at boundaries. Results stream to the recording window in real time:
{"type":"chunk","index":0,"text":"fix the login bug","confidence":0.94,"elapsed_ms":52}
{"type":"chunk","index":1,"text":"please terminal two","confidence":0.91,"elapsed_ms":48}
{"type":"done","totalChunks":2,"full_text":"fix the login bug please terminal two"}
The difference is in the plumbing. MLX chunks go through HTTP. CoreML chunks go through stdout. The recording window does not care — it receives the same progressive-audio-chunk IPC event either way.
When to Use Which
Both backends are available in Callipso's Config tab. Switch with a radio button.
Use CoreML when:
- You want the lowest latency (sub-300ms per chunk)
- You are on macOS 14+ (Sonoma or later)
- You want minimal resource usage (no persistent server)
- You are running GPU-heavy workloads alongside Callipso (Space tab, IDE rendering)
Use MLX when:
- You need the battle-tested path (months of production use)
- You are on an older macOS version
- You want the mature progressive pipeline with all edge cases handled
- You are already running the Python environment for other tools
The model is the same. The accuracy is the same. The difference is how many layers sit between your voice and the Neural Engine.
Try It
CoreML Parakeet is available now. Open Config, select CoreML under STT Backend. The model downloads once (~2.5GB), compiles for your chip on first inference, and is cached permanently after that.
First inference: 30-60 seconds. Every inference after: under 100ms. No server to start. No Python to install. Just speak.