§engineering / parakeet tdt v3 fluidaudio mac

Running Parakeet TDT v3 on Mac with FluidAudio

●08 May 2026·9 min read·by meetping team

MeetPing's listener runs Parakeet TDT v3 through FluidAudio, a Swift package that wraps the CoreML build of NVIDIA's Parakeet model. This is the writeup of what we measured, what bit us, and what we'd do differently.

The shape of the model

Parakeet TDT (Token-and-Duration Transducer) is a streaming ASR architecture: you feed it audio chunks, it emits tokens plus a per-token duration, and you get a partial transcript that gets confirmed (or revised) as more audio arrives. v3 covers 25 European languages with a single model — no per-language packs to download. The CoreML conversion in FluidAudio targets the Apple Neural Engine, which on M-series silicon is dramatically faster than CPU and does not steal cycles from the audio pipeline.

For a keyword-alert tool the streaming property is the whole game. We don't care about offline batch transcription; we care about confirmed tokens landing as fast as possible after they're spoken.

Latency profile

The streaming config we ship in v0.1 — taken straight from our `StreamingAsrConfig` and the FluidAudio defaults — is:

chunk window               3.0 s
hypothesis chunk           1.0 s
left context               1.0 s
right context              0.5 s
min context for confirm    2.0 s
confirmation threshold     0.6
expected first-partial     ~1.4 s on M1+ (chunk + plumbing)

The ~1.4-second first-partial budget is the gap between "speaker started talking" and "first text appears in the rolling buffer." For a live alert tool that's the spec the rest of the pipeline has to live inside — tight enough that the popover opens before the speaker has moved on to the next thought. Concrete WER and per-chunk inference numbers will get a post-launch benchmark addendum once we have a real corpus.

↳ pull quote
~1.4 s first-partial budget. The Apple Neural Engine does what it says.

Chunking

Streaming Parakeet wants overlapping chunks: a 3-second window with 1-second hypothesis re-evaluation works well. Smaller windows (1 s with 0.3 s hypothesis) trade latency for accuracy and produce a lot of revision noise — the model rewrites the last word it emitted on the next chunk, and you have to gate your downstream consumers (the keyword watcher, in our case) on confirmed tokens.

We learned the hard way to never run the keyword scan on partials. A keyword like "October" produced a partial of "octob..." that lit up a fuzzy match a quarter-second before the model decided the word was actually "October" — we pinged on a phantom. Confirmed-only is the rule.

The BNNS dylib trap

FluidAudio loads a BNNS dynamic library at runtime. Inside the App Sandbox, dyld can't resolve it because the lib path goes through DYLD_LIBRARY_PATH, which the sandbox blocks. Two options: ship with sandbox off (we did, in v0.1), or vendor the BNNS calls into a static library and rebuild without the dylib path. We might do the latter for v0.3.

If you're trying to ship a sandboxed app with FluidAudio, set the entitlement com.apple.security.cs.allow-dyld-environment-variables and Hardened Runtime "allow DYLD env vars" — that's the minimum that gets it to load. Even then we hit edge cases with the model file inside the .app bundle being read-only in a way the loader didn't like. The unsandboxed path was cheaper for v0.1.

Model size and load time

Parakeet TDT v3 in CoreML is ~95 MB. Cold load on first launch is ~1.8 s on M1 Pro. We do this on a background queue before the first arm so by the time the user starts a meeting the model is warm. Subsequent launches use the OS-level CoreML cache and are sub-300 ms.

What we'd change

Three things, in order:

1. Vendor BNNS to drop the dylib environment variable and re-enable sandbox. Mostly a build-system project, no model changes.
2. Add a per-language warm-up so the first chunk in non-English meetings doesn't cost an extra ~80 ms. The model handles all 25 languages but the encoder warms language-specifically.
3. Look at TDT v4 when it ships — early word is on better latency for uncommon proper nouns, which is exactly the case our phonetic-match layer currently papers over.

The take

Streaming Parakeet on the ANE is the right answer for live on-device meeting tools in 2026. The numbers are good enough to ship a product on, the architecture is clean enough to debug, and FluidAudio handles the worst parts of the CoreML conversion. The dylib loader is the one nasty surprise. Otherwise: highly recommended.

The user-side writeup of this same pipeline is on the on-device privacy feature page.

← back to blog download meetping ↗