§engineering / on-device speech recognition mac

Why on-device ASR matters for meeting tools

●08 May 2026·6 min read·by meetping team

The default architecture for an "AI meeting" tool in 2026 is to pipe audio to a cloud transcription API and let the server do the work. It's the path of least resistance — Whisper-as-a-service endpoints are cheap, the SDKs are clean, and you don't have to ship a 95 MB model bundle. We considered it for MeetPing for about an afternoon, then ruled it out. This post is the long version of why.

The latency case

The whole product is a live keyword alert. The latency budget is the gap between someone saying your name and the popover opening on your screen — anything more than a couple seconds is uncanny ("you knew about that twelve seconds ago?") and anything more than five is broken (the speaker has moved on, you're answering the wrong question).

Cloud ASR adds two real costs to that budget. First is the actual round-trip: ~80 ms within a region, ~200 ms cross-region, ~250 ms when the user is on a coffee-shop wifi or a stadium mobile network. Second is the chunking strategy — most streaming endpoints buffer 100-300 ms of audio before sending, so the model is always a beat behind. On Parakeet TDT v3 running through FluidAudio on the Apple Neural Engine, the first partial lands in ~1.4 s of wall-clock — and that number is constant. It does not depend on someone's hotel wifi.

↳ pull quote
The latency budget for a live keyword alert is two seconds, tops. Cloud ASR spends a quarter of that on round-trip alone.

The privacy case

A meeting transcript is a strange artifact. It's not yours alone — half of it is what the other people in the call said. Most companies have policies (often unspoken, sometimes written) about where that audio is allowed to land. Cloud ASR providers will tell you they don't train on your data and they encrypt at rest. That is true, and it does not satisfy a security review.

Running Parakeet on the Apple Neural Engine means the audio buffers never leave the MeetPing process. They sit in RAM for thirty seconds, age out, and get overwritten. There is no DPA to sign. There is no SOC 2 question. v0.1 has no backend at all — pull the ethernet, the listener still works. We've had three IT teams approve MeetPing in one round of review because the answer to "where does the audio go" is "it doesn't."

The IT-team approval case (which is really the privacy case in suit)

For a product priced at $24.90 lifetime, individual users are the buyer. But individual users on work laptops have IT teams who own the install policy. The cheapest way to make a tool un-buyable for that audience is to fail a security review, and the cheapest way to fail a security review is to send customer audio to a third party. We spend less time arguing with IT than any cloud-ASR competitor. This is mostly an accident of architecture.

The battery case

The Apple Neural Engine is fast and almost weirdly power-efficient. On an M1 Pro running a 90-minute Zoom call, MeetPing adds about 3-5% additional CPU on the main cores (mostly the audio plumbing, not the model itself) and a couple of percent on the ANE. Cloud ASR avoids the ANE cost but has to keep an audio upload connection live for the whole call, which on macOS plus a sketchy wifi looks more like 5-8% drain. Same ballpark, but the on-device version does not get worse when your wifi is bad.

What we lose

On-device is not free. We pay for it in three places. The .app is bigger (95 MB instead of ~20). We can't easily ship a model upgrade without an app update — though Sparkle handles that smoothly. And we can't pick from the bleeding edge of frontier models; we're constrained to what fits on the ANE today, which means Parakeet TDT v3 (excellent for this) instead of, say, a hypothetical large-multilingual model running in a cloud GPU.

For our shape of product the trade is obviously correct. Read running Parakeet TDT v3 on Mac with FluidAudio for the engineering details, or the on-device privacy feature page for what this means in practice.

The take

On-device ASR on Apple Silicon is no longer a compromise. Five years ago "local Whisper" meant CPU-pinned, slow, and single-language. In 2026 it means Parakeet TDT v3 on the ANE, twenty-five languages, sub-twenty-millisecond chunks, and a build that passes IT review on the first pass. If you're shipping a meeting tool that doesn't strictly need the cloud, don't put one between you and your users.

← back to blog download meetping ↗