I had a working prototype. A quantized Llama 3.2 3B model running through PyTorch's ExecuTorch runtime inside an iOS app I was building. On my MacBook it was beautiful. On a real iPhone it was a memorial service.
This is the story of why on-device AI is harder than the demos make it look, and why the app I'm working on ships with a different model than the one I started with.
Why On-Device At All
Cloud inference is a non-starter for a lot of apps. Anything that touches private data ("private" in the way users actually mean it, not in the way a privacy policy means it) shouldn't be shipping that data to a third party for "model improvement," logging, or a future breach. If the AI feature is going to exist at all, it has to run locally.
So I started where any reasonable person would start: PyTorch and an open-weight Llama.
PyTorch on a Laptop Is a Lie
Llama 3.2 3B is genuinely good. Three billion parameters is the sweet spot where a model can summarize unstructured text and handle light agent-like jobs without falling on its face. Through ExecuTorch with XNNPACK and SpinQuant or QAT+LoRA quantization, the weights drop to roughly 2GB on disk and the runtime footprint lands somewhere in the 2 to 3GB range depending on context length. The PyTorch team's own iOS demo proves it works.
On paper, that fits on an iPhone 15 Pro with its 8GB of RAM. On paper.
Meet Jetsam
iOS does not let your app use whatever memory it wants. A daemon called jetsam watches every process and, the moment one crosses an internal threshold, kills it instantly to keep the system responsive. There is no warning dialog, no graceful shutdown, no second chance. Your user taps a button, the model loads, the OOM hammer comes down, and the app vanishes. As far as the user is concerned, your app crashed because it is broken.
The default ceiling sits around half of the device's physical RAM, and that's the optimistic read. On an 8GB iPhone 15 Pro you can realistically count on something like 3GB before jetsam shows up. A quantized 3B Llama eats nearly all of that just to load weights and a small KV cache, and that's before you account for the rest of the app: SwiftUI views, Core Data, image caches, the OS overhead that lives inside your process. The math works in isolation. It does not work in the context of a real app that also has to do other things.
Apple does offer two escape hatches, both gated behind entitlements you have to formally request. com.apple.developer.kernel.increased-memory-limit raises the jetsam ceiling on supported devices, and com.apple.developer.kernel.extended-virtual-addressing unlocks the full 64-bit address space ("jumbo mode") so you can map files larger than the legacy limit. With both flipped on, an iPhone 15 Pro app can climb to roughly 6GB of usable memory before getting killed.
That gets the model and the rest of the app into the same process without immediately tripping jetsam. It does not get you to comfortable. Memory pressure on an 8GB phone is real, especially once another app is open in the background, and you are still gambling on every inference call. And the entitlement requires Apple's approval, which they hand out for "core features" rather than "I'd really like to." A nice-to-have AI feature is a much harder sell than a video editor that fundamentally cannot work without it.
The Apple-Shaped Hole
While I was wrestling with entitlements, Apple was solving the same problem with the home-field advantage of designing both the chip and the OS. The Foundation Models framework, introduced at WWDC 2025, exposes Apple Intelligence's roughly 3 billion parameter on-device model directly to third-party apps through a Swift API. No download, no entitlement paperwork, no hand-rolled tokenizer.
The interesting part is what Apple did to the model to make it actually fit. According to Apple's own 2025 tech report, the on-device model uses 2-bit quantization-aware training and a KV-cache sharing trick that splits the network into two blocks where the second reuses key-value caches from the first, cutting memory use by about 37.5%. The result is a 3B model that lives inside the OS, gets paged in and out by the system, and never counts against your app's jetsam budget the way a self-shipped Llama would.
The framework itself is the kind of thing you only get when the model and the runtime are designed together. A @Generable Swift macro lets you describe your output as a struct and the framework constrains the model's decoding to match it. Sessions handle context windows automatically, summarizing themselves when they get too long. For a summarization-and-light-agent workload, it covers the use case in a few dozen lines of Swift instead of a few thousand lines of C++ glue and prayer.
The tradeoffs are real. You're locked to Apple platforms. You get the model Apple gives you, on the schedule Apple decides, with the safety guardrails Apple chose. If your product needs a specific fine-tune or a different architecture, none of this helps. But for an iOS app whose AI job description is "read text, write shorter text, occasionally pull a structured field out of it," the calculus is not close.
Choose Your Llama
The PyTorch route is not wrong. If Slipstream taught me anything, it's that there is no substitute for getting your hands on the actual stack and watching it fail in concrete ways. ExecuTorch is the right answer if you need to ship the same model on Android and iOS, if you care about a specific open-weight checkpoint, or if you have the entitlement story and the device floor to back it up. For cross-platform builders and anyone who wants to avoid betting their roadmap on a single vendor, it is still where I'd start.
For an iOS-only app in 2026, Apple Intelligence is the obvious choice. The model is good enough, the framework does the unglamorous parts for you, and most importantly, jetsam leaves you alone. I spent a month learning that the hard way so the app I'm working on could ship without crashing on the phones people actually own.
The prototype was beautiful on my laptop. The shipping app is beautiful on a phone. Those are not the same problem.