Running an LLM on an Android Phone

In my last blog post, I showed you how to work with JFR files using DuckDB, which started a blog series that I surely will continue. Just not this week. Instead, I want to showcase a tiny app to run AI models using the MediaPipe API directly on your phone. I created the app for another purpose (perhaps described in a future blog post) earlier this year, but never wrote anything about it. So here we are.

TL;DR: I built an Android app that offers AI models via a server

The app is open-source and available on GitHub; it’s experimental, but maybe it can help you build your own apps. You can download the releases page of the repo and install it.

The LLM API endpoint, writing a poem on a backyard scene

The Android App

As already described, you can just download the app, but to fully use it, you need to install some AI models. For models like Google’s Gemma, which require authentication for download, you must click “…” to access the download link and then download the files from HuggingFace after agreeing to the license terms. After downloading, load the model file into the app using the “Load” button. The app can download other models directly. Please note that you may need to refresh the page manually. After installation, you can test the model directly with a basic prompt:

The app opens a port (typically 8005) and allows you to test its web endpoints directly. You can use it to capture images using the rear and front camera and do some object detection, using the EfficientDet Lite 2 model (not the best, but it’s small):

As you saw in the TL;DR section, you also prompt the installed LLMs, using them, for example, for better on-device object-detection:

Which leads to “slightly” better results than the EfficientDet Lite 2 model:

[
  {
    "object": "chair",
    "details": "woven wicker chair with a curved back and a metal frame. Covered in fallen leaves."
  },
  {
    "object": "table",
    "details": "wooden table, partially visible."
  },
  {
    "object": "leaves",
    "details": "Numerous fallen leaves, primarily yellow and brown, scattered on the ground."
  },
  {
    "object": "plants/vines",
    "details": "Green plants and vines growing on a wall or fence behind the chair and table."
  },
  {
    "object": "ground",
    "details": "Paved ground with a brick or stone pattern."
  }
]

However, in defense of the smaller model, the LLM took 40 times longer (46 seconds vs. 1.2 seconds).

Please note that, for privacy reasons, the app must be open and visible to capture images.

There is also the possibility of capturing the current orientation of the phone, but that’s similar to the other APIs.

Server Functionality

As I mentioned earlier, this app starts a server at port 8005, allowing you to easily access its AI capabilities from other apps and from the terminal, such as Termux or the Linux Terminal App.

You find all the available APIs and their request and response formats in the project’s README, but curl localhost:8005 also gives you an overview:

Please be aware that the /location API is currently not working. But all the other APIs are, as you’ve seen before. Querying the local LLM is simple via curl:

curl -d ‘{
“text”: “Write a short, nerdy poem”,
“model”: “GEMMA_3_1B_IT”
}’ localhost:8005/ai/text

The same for the orientation API:

Background

The Google AI Edge Gallery allows you to run the Gemma models directly on your phone with an interactive chat:

It’s a great app to explore three different Gemma and one Qwen models on the CPU and GPU of your smartphone, with linked API samples for the MediaPipe library.

The only Problem: I wanted to use these models and more in an emulated Linux running on my Android. However, these emulated OS instances can’t access the camera or other sensors, and they also run applications significantly slower than Android does directly. I created the app showcased in this blog post to expose all its functionality via a server.

In the following, we use the app to create a few command-line apps for Termux.

A Tiny Fortune Clone

A tiny sample use case would be a fortune clone. Fortune is a small UNIX utility that “prints a random, hopefully interesting, adage” (from its man-page):

fortune is a program that displays a pseudorandom message from a database of quotations. Early versions of the program appeared in Version 7 Unix in 1979.^[1] The most common version on modern systems is the BSD fortune, originally written by Ken Arnold.^[2] Distributions of fortune are usually bundled with a collection of themed files, containing sayings like those found on fortune cookies (hence the name), quotations from famous people, jokes, or poetry.
WIKIPEDIA oN ThE FORTUNE Utility

> fortune
You could get a new lease on life -- if only you didn't need the first
and last month in advance.
> fortune
Help me, I'm a prisoner in a Fortune cookie file!

Let’s create our own using the AI server in UNIX in a tiny shell script:

#!/bin/sh
# minimal fortune clone using local AI
# usage: ./fortune.sh

curl -s -d '{ "text": "You are a clone of the unix fortune tool, print a random, hopefully interesting, adage. Only print the single line adage directly.", "model": "GEMMA_3_1B_IT" }' localhost:8005/ai/text \
| sed -n 's/.*"response": *"\(.*\)".*/\1/p' \
| sed 's/\\n//g' \
| sed 's/^```//; s/```$//' \
| sed 's/^\.\.\.//' \
| sed 's/\\u0027/'"'"'/g'

It’s usually with two to three seconds, not the fastest fortune clone, but it gives interesting results (the right picture is a version instructed to be a funny clone):

Sometimes the AI server experiences issues; it’s still a prototype…

Conclusion

It’s all a big experiment, demonstrating the power of tiny AI models on modern smartphones. I hope you can use it to develop your own fun little apps or shell scripts in Termux, just as I did for this blog post.

Thank you for coming this far. I look forward to seeing you in the next few weeks for a blog post on instrumenting native agents.

Author

Johannes Bechberger

Johannes Bechberger is a JVM developer working on profilers and their underlying technology in the SapMachine team at SAP. This includes improvements to async-profiler and its ecosystem, a website to view the different JFR event types, and improvements to the FirefoxProfiler, making it usable in the Java world. His work today comprises many open-source contributions and his blog, where he regularly writes on in-depth profiling and debugging topics. He also works on hello-ebpf, the first eBPF library for Java. His most recent contribution is the new CPU Time Profiler in JDK 25.

View all posts

New posts like these come out at least every two weeks, to get notified about new posts, follow me on BlueSky, Twitter, Mastodon, or LinkedIn, or join the newsletter:

Mostly nerdless

Every two weeks a text on profiling, debugging or eBPF