On-device vs cloud transcription, picked apart
The transcription stack you pick shapes the product on top of it. Cloud ASR — Deepgram, AssemblyAI, OpenAI's transcription endpoints — has been the default for five years because cloud GPUs run frontier models and the SDKs are tidy. On-device ASR on Apple Silicon was a curiosity for years and is suddenly competitive. This post is the side-by-side, with no sponsor.
The honest scoreboard
Cloud wins on three things: model size (you can run a 1.5 B parameter Whisper variant or a 6 B model without trouble), language coverage (100+ in the major commercial models), and the ability to ship a model upgrade without an app update. On-device wins on four things: latency, privacy posture, network independence, and total cost.
Everything below the surface is a trade between those two columns. Which side wins depends entirely on the shape of the product. We picked on-device for MeetPing; we would have picked cloud for a multilingual call-centre product.
Latency, in real numbers
On-device on the Apple Neural Engine (Parakeet TDT v3 via FluidAudio): first partial in ~1.4 s of wall-clock, consistent across network conditions. Subsequent partials arrive every ~80 ms.
Cloud streaming endpoint: best case ~80 ms round-trip within a region plus 100-300 ms of server-side chunking, so the first partial lands in 200-400 ms. Sounds like on-device loses — but those numbers are conditional on a good connection. Hotel wifi, mobile network, or a coffee shop add 200-500 ms unpredictably, and a dropped chunk forces a re-send.
For a post-meeting transcript, neither latency profile matters. For a live keyword alert, the variance is what kills cloud — a tool that fires inside two seconds at home and inside seven on a train is not actually a tool. See why on-device ASR matters for the longer version.
↳ pull quoteCloud is faster on a good network and slower on a bad one. On-device is the same speed everywhere. Which one wins depends on whether your product can tolerate the variance.
Privacy and the IT-team layer
Cloud ASR providers will tell you they do not train on your data and they encrypt at rest. Both are true. Neither one is what the security review is actually asking. The question that determines whether the tool ships inside a company is: "where does the audio go." If the answer is "to a third party," there is a DPA, a SOC 2 question, a regional-residency clause, and three weeks of back-and-forth with legal. If the answer is "nowhere — it stays on the laptop and ages out of RAM," there is no question to answer.
This is not theoretical. MeetPing has been through three IT reviews and cleared all three on the first round because the on-device posture removes the surface area the review was about. See the on-device privacy feature page for what the audit posture looks like in practice.
Cost, end-to-end
Cloud streaming ASR pricing in 2026 sits around $0.30 per audio hour for high-quality streaming endpoints, more for enterprise terms. For a single user in 20 hours of meetings per week, that is $24/month — about $290/year. For a 50-person team, north of $14k/year.
On-device has a fixed cost: the ANE cycles you would have spent anyway. The marginal cost of one more meeting is zero. For a tool sold at $24.90 lifetime, on-device is the only path that makes the unit economics work without a monthly recurring component.
What cloud is genuinely better at
On-device on Apple Silicon today is constrained to what fits on the ANE. That excludes the largest multilingual models and the very newest research models. If your product needs to transcribe Swahili, or you need diarisation on a 12-speaker call, or you want to swap to whatever the frontier ASR model is each month, cloud is the right call.
Cloud also wins on model-update agility. A cloud provider can roll out a new model on Tuesday and every customer gets it on Wednesday. On-device requires shipping an app update — which on Sparkle is straightforward, but still requires the user to install something.
What on-device is genuinely better at
Five things that are hard to do over a network:
- Working with no internet. Cellular dead zones, airplanes, sketchy hotel wifi — the on-device pipeline does not care.
- Sub-two-second alerts. The latency variance on cloud streaming pushes the worst case above the usable threshold.
- Predictable battery. The ANE is power-efficient and constant. A long-lived audio upload connection on bad wifi is not.
- Passing IT review. No data leaves the laptop. No DPA. No vendor question.
- Fixed-price product economics. A one-time purchase is only viable if the marginal cost per user-hour is zero.
Which to pick
Decision rule we use when consulting on this: if the product needs to fire alerts in real time, run inside large companies, or sell as a one-time purchase, go on-device. If the product is a post-meeting record of truth, needs heavy multilingual coverage, or runs on a server you control anyway (call-centre ops, podcast pipelines), go cloud. The middle ground — a productivity tool for individuals — usually breaks toward on-device because the privacy posture sells the install. See MeetPing vs Granola for a worked example of the shape difference in the same category.
The take
The cloud-vs-on-device choice in 2026 is not a quality choice — both are very good. It is a product-shape choice. Pick the side whose strengths match the thing you are building. For MeetPing the answer was on-device and it would still be on-device if we restarted today.