What changes on mobile
Web copilots can hide latency under streaming UX and predictable wifi. Mobile users are on intermittent cellular, with screen-on time measured in seconds, and gestures that compete with the copilot's own UI. Every assumption you carry from web - token budgets, retry behavior, network optimism - needs revisiting.
We've shipped mobile copilots for field-services teams, financial-services apps, and consumer products. The patterns below come from that work - and from the failure modes we saw in the first year of trying to retrofit web copilots onto mobile shells.
- First-token budget ≤ 800ms
Below that users perceive instant. Above 1.5s they tap away. Cellular adds 200–600ms per round-trip; cache, prefetch, and route on-device for low-stakes.
- Offline fallback for top 20% of intents
Push the most common 20% of intents to an on-device classifier with templated answers. Works on a plane, in a basement, on a degraded network. Falls through to cloud when connectivity returns.
- On-device inference for routing + low-stakes generation
Apple Intelligence, Gemini Nano, and small open-weight models (Phi, Llama 3.2 1B) cover routing and short-form generation. Cloud is for the long tail.
- Native gesture coexistence
The copilot UI must not steal swipe-back, scroll-to-refresh, or keyboard return. Build with native primitives - not webviews - so the gestures compose.
- Streaming-aware battery
Long streams keep the radio on. Cap streaming responses; bias toward concise outputs on cellular; finish-and-disconnect rather than maintain idle connections.
View data table· Source: Techimax mobile rollout telemetry, 6 customer apps, 2024–2026
| Series | ms |
|---|---|
| Wifi (50+ Mbps) | 420 |
| 5G | 580 |
| LTE (good) | 740 |
| LTE (degraded) | 1240 |
| On-device (Phi-3 mini) | 110 |
Design-system parity isn't optional
On mobile, copilot surfaces share the screen with native components. Spacing, motion, type, and tap-target sizes need to match - otherwise the copilot reads as a third-party widget and trust drops.
Concretely: build copilot UI with the same SwiftUI / Compose primitives the rest of the app uses; use your color tokens; respect Dynamic Type; honor reduce-motion. Do this and the copilot reads as part of the product. Skip this and users uninstall.
| Metric | Target | Why |
|---|---|---|
| First-token latency p50 | < 800ms | Below perceived-instant threshold |
| First-token latency p95 | < 2.5s | Long-tail tolerable on cellular |
| Stream complete p50 | < 4s | Average response < 200 tokens |
| Battery cost per session | < 0.4% / 60s session | Comparable to a video call segment |
| Crash-free sessions | > 99.9% | Native quality bar |
| Cold-start to first interaction | < 1.4s | Below app launch threshold; users abandon above 2s |
| Cellular data per session | < 200KB | Fair to users on metered plans |
On-device architecture: when local inference beats cloud
Apple Intelligence's Foundation Model (~3B parameters), Gemini Nano on Pixel and Galaxy devices, and small open-weight models (Phi-3 mini, Llama 3.2 1B/3B) handle a meaningful slice of mobile copilot workloads with sub-100ms first-token latency, no network round-trip, and zero per-call cost [1][2]. The trade-off: bounded reasoning, no real-time knowledge, no tool calling.
The pragmatic split we ship: route low-stakes intents (classification, short summaries, formatting, named entity extraction, simple Q&A) to on-device. Route long-tail and tool-using intents to cloud. The router itself can be a tiny on-device classifier - adding 8ms of decision latency to save 600ms+ of cloud round-trip when the cloud isn't needed.
| Intent class | On-device | Cloud | Reasoning |
|---|---|---|---|
| Classify / route user input | Yes | - | Low-stakes; latency-critical |
| Short-form rewrite (< 100 tokens) | Yes | - | Battery + offline win |
| Multi-step research | - | Yes | Needs tool calls + larger context |
| Document drafting | Hybrid | Yes | On-device draft; cloud refine |
| Translation | Yes | - | Apple/Gemini Nano handle major languages |
| Tool-calling action (refund, send) | - | Yes | Needs auth + audit + reliability |
Designing for the offline-by-default user
Most mobile copilot research assumes connectivity. Our field-services and consumer-product engagements ship to users on the New York subway, in rural clinics, in basement parking garages. Offline isn't an edge case - it's a primary user state for the top 20% of intents.
What works: cache the user's last 30 days of activity for context, ship a 10–50MB on-device intent classifier, queue cloud-bound requests with idempotency keys when offline, and surface a clear "working offline" affordance so users know what they can and can't do. The pattern is borrowed from offline-first PWA work but applies cleanly to native [3].
View data table· Source: Techimax mobile rollout telemetry, 2025
| Series | Value |
|---|---|
| On-device sufficient | 41 |
| Cloud (cached context OK) | 32 |
| Cloud + live data needed | 19 |
| Tool-calling action | 8 |
Native vs cross-platform: where the breakage shows up
We ship in SwiftUI/Compose, React Native, and Flutter depending on the customer's existing stack. The honest answer: for primary copilot surfaces, native is meaningfully better; for secondary surfaces, cross-platform is fine. The breakage points in cross-platform are streaming text rendering (gesture conflicts), keyboard accessory bars, haptics, and on-device model integration.
Concrete patterns that survive cross-platform: chat list with markdown rendering, simple cancellation, basic streaming. Patterns that break: cursor-aware inline suggestions in native text fields, voice mode with low-latency interrupt, deep on-device model integration. If your copilot needs the latter, build native.
On mobile, the 95th-percentile cellular round-trip is the user experience. On-device handling for the top 20% of intents flattens that tail and saves the copilot from being uninstalled.
Voice and multimodal: the next mobile-first surface
By 2026, voice-first copilot interactions are increasingly the default for hands-busy workflows (driving, field services, hospital floors). The engineering bar is harder than text: low-latency interrupt handling, on-device wake-word, streaming audio in and out, sub-300ms perceived response. OpenAI's Realtime API and Google's bidirectional streaming both enable this; Anthropic's voice integration is following [4].
What ships: native AVAudioEngine / AudioRecord pipelines, server-side streaming over WebSockets or WebRTC, eval cases that include audio (transcription accuracy, refusal calibration on adversarial audio, latency budget). Don't bolt voice onto a chat UI - voice is its own surface with its own user expectations.
References
- [1]Apple Intelligence developer docs - Apple (2025)
- [2]Gemini Nano on Android - Google (2025)
- [3]Offline-first design patterns - Google web.dev (2024)
- [4]OpenAI Realtime API documentation - OpenAI (2025)
- [5]HIPAA Security Rule guidance for mobile devices - HHS Office for Civil Rights (2024)
- [6]Phi-3 mini technical report - Microsoft Research (2024)
Frequently asked questions
Should we build native or React Native / Flutter?
Native (SwiftUI, Compose) for any app where the copilot is a primary surface - the gesture and design-system issues compound across cross-platform shells. RN / Flutter work for secondary surfaces; we ship in all three depending on the customer's existing stack.
How does Apple Intelligence factor in?
Use it for what it's good at - system-integrated intents (lookups, drafting, summarization) - and complement with a cloud agent for long-tail tasks. Don't replace your agent with Apple Intelligence; it doesn't know your domain.
What about Android's Gemini Nano?
Same answer. On-device for routing + short generation; cloud for everything else. Both Apple and Google are moving toward hybrid by default.
How do we handle model updates without forcing app updates?
Ship the on-device classifier as a downloadable bundle, signed and version-pinned, refreshed on a separate cadence from the app binary. Apple's MLPackages and Android's MediaPipe both support this. Decouple model lifecycle from app lifecycle.
What's the right battery budget?
Below 0.4% per 60-second active session for cloud calls; below 0.15% for on-device-only sessions. Above that, users notice and disable. Long-running streams keep the radio active and cost more - bias toward concise responses on cellular.
How do we test on cellular conditions?
Use Apple's Network Link Conditioner and Android's network-shaping APIs in CI. Profile the copilot on three profiles: 5G, LTE-good, LTE-degraded. The p95 measurement on LTE-degraded is the experience your support team will hear about.
Are there special HIPAA considerations for on-device inference?
On-device inference doesn't transmit PHI off-device, which simplifies the BAA scope. Still log access locally; rotate logs; encrypt at rest. The standard mobile security baseline applies; we ship with HHS guidance reviewed [5].
What about accessibility on mobile copilots?
VoiceOver / TalkBack support, Dynamic Type honoring, reduce-motion respect. We test against WCAG 2.2 AA and Apple's Accessibility Inspector / Android's Accessibility Scanner on every release. Streaming text rendering is the trickiest a11y case - announce the final text, not every token.