.png)
Most AI features in mobile apps share a quiet dependency that nobody talks about until it becomes a problem: they need the internet to work.
The typical setup goes like this: the app sends audio or text to a cloud service, waits for a response, and displays the result. That works fine in an office, on a city street, or in a coffee shop with decent Wi-Fi. But put that same app in the hands of a nurse doing home visits in a rural area, a logistics driver moving through dead zones, or a field engineer working inside a steel structure, and the feature that looked great in the demo quietly stops working.
This is the problem on-device AI was built to solve. Over the past several months, we've been researching, building, and testing AI features that run entirely on a mobile device, with no internet connection required, no cloud API, and no data leaving the phone. Here's what it is, who actually needs it, how we built it, and what doesn't work yet.
On-device AI, often called Edge AI, means the intelligence runs on the phone itself rather than on a remote server. When it operates with no internet connection required at all, you'll also see it referred to as Offline AI. Instead of your app sending data to a cloud service and waiting for a response, the AI model is embedded directly in the app. It processes audio, text, images, or commands using the device's own processor and returns a result without ever touching the internet.
This is different from the AI features most people are familiar with. When you use ChatGPT, your message travels to a server, gets processed, and a response comes back. That round trip requires connectivity, costs money per use, and means your data has passed through someone else's infrastructure. With on-device AI, none of that happens, because all the processing stays on the device from start to finish.
One thing worth clarifying before going further: on-device AI today is not a general-purpose assistant, and it won't replace a cloud-based model that can answer anything about anything. What it can do is handle specific tasks well, including transcribing speech, summarizing text, detecting objects, recognizing faces, and extracting keywords. A well-chosen on-device model does its one job reliably, offline, and at no recurring cost.
This is the most important reason to consider on-device AI, and also the most underestimated one during product planning. Apps get designed and tested by people sitting in offices with fast Wi-Fi, but they get used by people in warehouses, on highways, in hospitals, on boats, and on construction sites. Connectivity is simply not guaranteed in those environments, and when an AI feature fails silently because there's no signal, users don't think "the network is down." They think the app doesn't work.
On-device AI removes that failure mode entirely, which means your feature works the same whether the user has five bars of signal or none at all.
When your app sends audio recordings or conversation transcripts to a cloud service, you've created a data processing relationship with a third party. For industries handling sensitive information, like patient conversations, client meetings, or insurance assessments, this creates compliance and legal implications that most businesses underestimate until something goes wrong.
With on-device AI, the audio is recorded, processed, and summarized entirely on the user's device. There's no API call, no data transmission, and nothing to audit or secure on a third-party server.
Cloud AI services charge per minute, per token, or per API call. In the early stages of a product that cost seems manageable, but as the user base grows, it becomes a line item that scales with every single interaction.
On-device AI is a development cost, not an operational one. Once the model is built into the app, there's no recurring charge. A user can run the feature a hundred times a day and it costs the same as running it once.
This was actually the original motivation behind our own research. A client was using a third-party transcription service that worked only when connected, and both the recurring cost and the connectivity dependency were problems they wanted to solve. That conversation started the investigation that led to everything described in this post.
Not every business or every app is the right fit. If your users are always connected, your data isn't sensitive, and your volumes are low, a cloud API may well serve you better. There is, though, a clear set of situations where on-device AI isn't a nice-to-have upgrade but the right answer.
Logistics and field operations. Drivers, delivery coordinators, and site workers move through environments where connectivity is unpredictable. Voice-to-notes, trip summaries, document scanning, and inspection reports need to work on a rural route or in a parking lot outside a warehouse, not just when the driver happens to have signal.
Healthcare field workers. Nurses doing home visits, community health workers, and paramedics operate in environments where both connectivity and privacy matter at the same time. Transcribing a patient conversation on-device, generating a structured note from it, and uploading that note when back on the network solves two problems at once.
Maritime and offshore workers. A sailor may have internet access once a week. Offshore oil platforms, fishing vessels, and cargo ships operate in connectivity conditions that most app developers never consider when designing features. Any capability that depends on a live API connection is effectively unusable for these users the majority of the time.
Energy, construction, and engineering teams. Inspection crews working underground, inside steel structures, or in remote infrastructure sites regularly operate in signal-dead environments. For these workers, AI-assisted checklists, automatic transcription of voice memos, and on-device document processing aren't premium features but basic requirements.
Conservation, national parks, and field research. A field researcher trying to identify a species, or a park visitor asking about the trail they're standing on, needs an answer right now, not after finding Wi-Fi.
Regulated industries handling sensitive conversations. Insurance assessors, legal professionals, and clinical staff all have situations where recording and processing a conversation creates compliance obligations. On-device AI removes the question of third-party data access entirely, because there simply is no third party involved.
The feature we built and tested is one of the most common AI requests we hear from clients: record a conversation, transcribe it, and generate a structured summary, fully offline, on an Android or iOS device.
The system has two stages, speech-to-text and summarization, and they're deliberately separate. They require different models, run at different points in the workflow, and need to be optimized independently from each other.
On Android, we evaluated four options.
The Android system's built-in speech recognizer is the obvious starting point since it's native, free, and requires no integration work. It has one fundamental limitation, though: it terminates automatically when the user pauses speaking. It was designed for short voice commands, not for recording a meeting or an interview, so for any long-form transcription it's simply the wrong tool.
Vosk is an open-source engine that handles live audio streaming with very low latency, with partial results appearing as the user speaks in under 200 milliseconds. It runs on low-end devices, uses minimal battery, and works completely offline. The tradeoff is output quality: Vosk produces plain text with no punctuation and no sentence structure, which means the raw output needs post-processing before you can build anything useful on top of it.
Whisper.cpp is OpenAI's open-source transcription model adapted for on-device use. It doesn't support live streaming natively, so we had to split audio into 3-second chunks and process them sequentially to approximate real-time behavior. For batch processing of recorded audio, however, it's the clear quality winner. We tested two model sizes: the 42MB tiny-q8_0 transcribed a 5-minute file in 35.8 seconds, while the 78MB base-q8_0 took 77.8 seconds for the same file but produced noticeably better sentence structure and punctuation. For most use cases, the smaller model is the right default.
Picovoice is a commercial option with small models and strong performance. It works well technically, but it requires a paid license for production use at 250 minutes per month free, then $500 a month for 25,000 minutes. For some use cases that's reasonable, but for others it reintroduces the exact per-usage cost structure that on-device AI was supposed to eliminate in the first place.
Our recommendation is Vosk for live streaming scenarios and Whisper.cpp for processing recorded files. They're different tools for different situations, and both feed into the same summarization pipeline downstream.
On iOS, the picture is considerably simpler. Apple's Speech Transcriber API, available on iOS 26, delivers real-time transcription with high accuracy running entirely through the Neural Engine. For modern devices it's the best option available.
We tested four on-device language models on Android, which were Gemma3-270M, Gemma3-1B, TinyLlama-1.1B, and Qwen3-0.6B.
Gemma3-270M at 276MB is the smallest and fastest of the group. It runs on mid-range devices without issue, but it struggles with anything complex or multi-topic. For a simple, linear conversation it does a reasonable job, while for a meeting that covered several subjects the output tends to become inconsistent.
Gemma3-1B at 555MB improves coherence over the smaller model but starts pushing against memory and thermal limits on mid-range hardware. The improvement in output quality doesn't justify the additional resource cost for most practical scenarios.
TinyLlama-1.1B at 1.12GB produces better language quality in theory, but it runs unstably on most Android devices because the model is too large and inference is too slow for a production environment.
Qwen3-0.6B at 614MB won on every metric that matters, producing structured bullet-point summaries, following instructions reliably, and handling long transcripts consistently well. It's the model we recommend for production on-device summarization on Android.
On iOS, Apple's Foundation Models API, available on iPhone 15 Pro and later, provides the simplest summarization path currently available. It requires a single API call with no model management needed. For older devices, Reductio, a pure-Swift extractive summarization library, provides a solid and reliable fallback.
Going native on iOS opens up more than AI. If this has you thinking about what else Apple's platform can do, we put together a practical guide on App Intents — covering Siri-driven workflows, voice commands, widget integration, and HomeKit control, all built natively without leaving the iOS ecosystem.
Download the guide: Taking Advantage of Native iOS Platform Features
For the kinds of recordings that come up in everyday professional use, a field visit, a client meeting, a shift briefing, on-device transcription and summarization is a practical and usable workflow. A 5-minute recording produces a transcript in under a minute and a clean summary in around 2 to 3 minutes on a modern mid-to-high-end device.
We ran the full pipeline on a one-hour recording to understand where the limits are, and the results were instructive. Transcription alone took 18 minutes, and the resulting transcript came out at roughly 96,600 characters. Summarization, which required splitting the transcript into 12 chunks of 2,000 tokens each and processing them sequentially, took 1 hour and 40 minutes, bringing the total processing time for a one-hour file to approximately 2 hours.
The conclusion is clear: LLM-based summarization of recordings longer than around 30 minutes isn't practical on current mobile hardware. For longer files, the right approach is extractive summarization, a statistical method that selects and ranks the most important sentences from the transcript rather than generating new text. It runs in milliseconds, produces no hallucinations, and works on every device. The output quality is lower than a generative model, but it's fast, reliable, and doesn't push the device into thermal stress.
This is the expectation that needs the most adjustment before a project starts, because it shapes the entire scope conversation.
You can't build a model that transcribes audio, answers general questions, identifies objects, translates languages, and detects anomalies at a size that fits on a phone with acceptable performance. The right approach is to identify one or two tasks that matter most for your specific users and build those well. A logistics app might need voice-to-notes and route documentation. A healthcare app might need conversation transcription and structured note generation. Both are achievable today, and both work reliably offline.
The results we saw on a Pixel 6 Pro with 12GB of RAM are not the results you'll see on a mid-range Android device with 4GB. For apps targeting a wide range of hardware, a tiered approach is necessary: a higher-quality pipeline for modern flagship devices and a lighter-weight fallback for everything else.
Building on-device AI is more involved than integrating a cloud API. On Android, the best transcription approach integrates through the NDK, which is a lower-level development path that requires careful memory management and threading. That takes real engineering time and expertise to do well. But it's a one-time investment, and once the pipeline is built, it runs at no ongoing cost for every user on every session for as long as the app is live.
Apple's latest APIs make on-device summarization a single function call on modern iPhones. Google is actively building native AI capabilities into Android. The models themselves are getting smaller and more capable with every release cycle.
The gap between what on-device AI can do and what cloud AI can do is closing steadily. The gap between what on-device AI costs to run and what cloud AI costs to run is not.
If your users work in places where the internet doesn't reliably reach, handle conversations that shouldn't leave the room, or rely on features that currently break every time signal drops, this is worth a conversation.
We've done the research, and we know what works.
Most AI features in mobile apps share a quiet dependency that nobody talks about until it becomes a problem: they need the internet to work.
The typical setup goes like this: the app sends audio or text to a cloud service, waits for a response, and displays the result. That works fine in an office, on a city street, or in a coffee shop with decent Wi-Fi. But put that same app in the hands of a nurse doing home visits in a rural area, a logistics driver moving through dead zones, or a field engineer working inside a steel structure, and the feature that looked great in the demo quietly stops working.
This is the problem on-device AI was built to solve. Over the past several months, we've been researching, building, and testing AI features that run entirely on a mobile device, with no internet connection required, no cloud API, and no data leaving the phone. Here's what it is, who actually needs it, how we built it, and what doesn't work yet.
On-device AI, often called Edge AI, means the intelligence runs on the phone itself rather than on a remote server. When it operates with no internet connection required at all, you'll also see it referred to as Offline AI. Instead of your app sending data to a cloud service and waiting for a response, the AI model is embedded directly in the app. It processes audio, text, images, or commands using the device's own processor and returns a result without ever touching the internet.
This is different from the AI features most people are familiar with. When you use ChatGPT, your message travels to a server, gets processed, and a response comes back. That round trip requires connectivity, costs money per use, and means your data has passed through someone else's infrastructure. With on-device AI, none of that happens, because all the processing stays on the device from start to finish.
One thing worth clarifying before going further: on-device AI today is not a general-purpose assistant, and it won't replace a cloud-based model that can answer anything about anything. What it can do is handle specific tasks well, including transcribing speech, summarizing text, detecting objects, recognizing faces, and extracting keywords. A well-chosen on-device model does its one job reliably, offline, and at no recurring cost.
This is the most important reason to consider on-device AI, and also the most underestimated one during product planning. Apps get designed and tested by people sitting in offices with fast Wi-Fi, but they get used by people in warehouses, on highways, in hospitals, on boats, and on construction sites. Connectivity is simply not guaranteed in those environments, and when an AI feature fails silently because there's no signal, users don't think "the network is down." They think the app doesn't work.
On-device AI removes that failure mode entirely, which means your feature works the same whether the user has five bars of signal or none at all.
When your app sends audio recordings or conversation transcripts to a cloud service, you've created a data processing relationship with a third party. For industries handling sensitive information, like patient conversations, client meetings, or insurance assessments, this creates compliance and legal implications that most businesses underestimate until something goes wrong.
With on-device AI, the audio is recorded, processed, and summarized entirely on the user's device. There's no API call, no data transmission, and nothing to audit or secure on a third-party server.
Cloud AI services charge per minute, per token, or per API call. In the early stages of a product that cost seems manageable, but as the user base grows, it becomes a line item that scales with every single interaction.
On-device AI is a development cost, not an operational one. Once the model is built into the app, there's no recurring charge. A user can run the feature a hundred times a day and it costs the same as running it once.
This was actually the original motivation behind our own research. A client was using a third-party transcription service that worked only when connected, and both the recurring cost and the connectivity dependency were problems they wanted to solve. That conversation started the investigation that led to everything described in this post.
Not every business or every app is the right fit. If your users are always connected, your data isn't sensitive, and your volumes are low, a cloud API may well serve you better. There is, though, a clear set of situations where on-device AI isn't a nice-to-have upgrade but the right answer.
Logistics and field operations. Drivers, delivery coordinators, and site workers move through environments where connectivity is unpredictable. Voice-to-notes, trip summaries, document scanning, and inspection reports need to work on a rural route or in a parking lot outside a warehouse, not just when the driver happens to have signal.
Healthcare field workers. Nurses doing home visits, community health workers, and paramedics operate in environments where both connectivity and privacy matter at the same time. Transcribing a patient conversation on-device, generating a structured note from it, and uploading that note when back on the network solves two problems at once.
Maritime and offshore workers. A sailor may have internet access once a week. Offshore oil platforms, fishing vessels, and cargo ships operate in connectivity conditions that most app developers never consider when designing features. Any capability that depends on a live API connection is effectively unusable for these users the majority of the time.
Energy, construction, and engineering teams. Inspection crews working underground, inside steel structures, or in remote infrastructure sites regularly operate in signal-dead environments. For these workers, AI-assisted checklists, automatic transcription of voice memos, and on-device document processing aren't premium features but basic requirements.
Conservation, national parks, and field research. A field researcher trying to identify a species, or a park visitor asking about the trail they're standing on, needs an answer right now, not after finding Wi-Fi.
Regulated industries handling sensitive conversations. Insurance assessors, legal professionals, and clinical staff all have situations where recording and processing a conversation creates compliance obligations. On-device AI removes the question of third-party data access entirely, because there simply is no third party involved.
The feature we built and tested is one of the most common AI requests we hear from clients: record a conversation, transcribe it, and generate a structured summary, fully offline, on an Android or iOS device.
The system has two stages, speech-to-text and summarization, and they're deliberately separate. They require different models, run at different points in the workflow, and need to be optimized independently from each other.
On Android, we evaluated four options.
The Android system's built-in speech recognizer is the obvious starting point since it's native, free, and requires no integration work. It has one fundamental limitation, though: it terminates automatically when the user pauses speaking. It was designed for short voice commands, not for recording a meeting or an interview, so for any long-form transcription it's simply the wrong tool.
Vosk is an open-source engine that handles live audio streaming with very low latency, with partial results appearing as the user speaks in under 200 milliseconds. It runs on low-end devices, uses minimal battery, and works completely offline. The tradeoff is output quality: Vosk produces plain text with no punctuation and no sentence structure, which means the raw output needs post-processing before you can build anything useful on top of it.
Whisper.cpp is OpenAI's open-source transcription model adapted for on-device use. It doesn't support live streaming natively, so we had to split audio into 3-second chunks and process them sequentially to approximate real-time behavior. For batch processing of recorded audio, however, it's the clear quality winner. We tested two model sizes: the 42MB tiny-q8_0 transcribed a 5-minute file in 35.8 seconds, while the 78MB base-q8_0 took 77.8 seconds for the same file but produced noticeably better sentence structure and punctuation. For most use cases, the smaller model is the right default.
Picovoice is a commercial option with small models and strong performance. It works well technically, but it requires a paid license for production use at 250 minutes per month free, then $500 a month for 25,000 minutes. For some use cases that's reasonable, but for others it reintroduces the exact per-usage cost structure that on-device AI was supposed to eliminate in the first place.
Our recommendation is Vosk for live streaming scenarios and Whisper.cpp for processing recorded files. They're different tools for different situations, and both feed into the same summarization pipeline downstream.
On iOS, the picture is considerably simpler. Apple's Speech Transcriber API, available on iOS 26, delivers real-time transcription with high accuracy running entirely through the Neural Engine. For modern devices it's the best option available.
We tested four on-device language models on Android, which were Gemma3-270M, Gemma3-1B, TinyLlama-1.1B, and Qwen3-0.6B.
Gemma3-270M at 276MB is the smallest and fastest of the group. It runs on mid-range devices without issue, but it struggles with anything complex or multi-topic. For a simple, linear conversation it does a reasonable job, while for a meeting that covered several subjects the output tends to become inconsistent.
Gemma3-1B at 555MB improves coherence over the smaller model but starts pushing against memory and thermal limits on mid-range hardware. The improvement in output quality doesn't justify the additional resource cost for most practical scenarios.
TinyLlama-1.1B at 1.12GB produces better language quality in theory, but it runs unstably on most Android devices because the model is too large and inference is too slow for a production environment.
Qwen3-0.6B at 614MB won on every metric that matters, producing structured bullet-point summaries, following instructions reliably, and handling long transcripts consistently well. It's the model we recommend for production on-device summarization on Android.
On iOS, Apple's Foundation Models API, available on iPhone 15 Pro and later, provides the simplest summarization path currently available. It requires a single API call with no model management needed. For older devices, Reductio, a pure-Swift extractive summarization library, provides a solid and reliable fallback.
Going native on iOS opens up more than AI. If this has you thinking about what else Apple's platform can do, we put together a practical guide on App Intents — covering Siri-driven workflows, voice commands, widget integration, and HomeKit control, all built natively without leaving the iOS ecosystem.
Download the guide: Taking Advantage of Native iOS Platform Features
For the kinds of recordings that come up in everyday professional use, a field visit, a client meeting, a shift briefing, on-device transcription and summarization is a practical and usable workflow. A 5-minute recording produces a transcript in under a minute and a clean summary in around 2 to 3 minutes on a modern mid-to-high-end device.
We ran the full pipeline on a one-hour recording to understand where the limits are, and the results were instructive. Transcription alone took 18 minutes, and the resulting transcript came out at roughly 96,600 characters. Summarization, which required splitting the transcript into 12 chunks of 2,000 tokens each and processing them sequentially, took 1 hour and 40 minutes, bringing the total processing time for a one-hour file to approximately 2 hours.
The conclusion is clear: LLM-based summarization of recordings longer than around 30 minutes isn't practical on current mobile hardware. For longer files, the right approach is extractive summarization, a statistical method that selects and ranks the most important sentences from the transcript rather than generating new text. It runs in milliseconds, produces no hallucinations, and works on every device. The output quality is lower than a generative model, but it's fast, reliable, and doesn't push the device into thermal stress.
This is the expectation that needs the most adjustment before a project starts, because it shapes the entire scope conversation.
You can't build a model that transcribes audio, answers general questions, identifies objects, translates languages, and detects anomalies at a size that fits on a phone with acceptable performance. The right approach is to identify one or two tasks that matter most for your specific users and build those well. A logistics app might need voice-to-notes and route documentation. A healthcare app might need conversation transcription and structured note generation. Both are achievable today, and both work reliably offline.
The results we saw on a Pixel 6 Pro with 12GB of RAM are not the results you'll see on a mid-range Android device with 4GB. For apps targeting a wide range of hardware, a tiered approach is necessary: a higher-quality pipeline for modern flagship devices and a lighter-weight fallback for everything else.
Building on-device AI is more involved than integrating a cloud API. On Android, the best transcription approach integrates through the NDK, which is a lower-level development path that requires careful memory management and threading. That takes real engineering time and expertise to do well. But it's a one-time investment, and once the pipeline is built, it runs at no ongoing cost for every user on every session for as long as the app is live.
Apple's latest APIs make on-device summarization a single function call on modern iPhones. Google is actively building native AI capabilities into Android. The models themselves are getting smaller and more capable with every release cycle.
The gap between what on-device AI can do and what cloud AI can do is closing steadily. The gap between what on-device AI costs to run and what cloud AI costs to run is not.
If your users work in places where the internet doesn't reliably reach, handle conversations that shouldn't leave the room, or rely on features that currently break every time signal drops, this is worth a conversation.
We've done the research, and we know what works.
