tapflow

The DX we wanted for tapflow setup: a host-ready Mac in one command

duchan jo — Wed, 17 Jun 2026 14:00:05 GMT

tapflow streams iOS simulators and Android emulators into a browser, so a whole team can test an app without installing anything. The simulators run on a Mac that hosts the tapflow agent, and that Mac needs the mobile toolchain: Xcode, a simulator runtime, the Android SDK, an emulator, AVDs.

tapflow already gives the people who use it a zero-install experience. We wanted the person who hosts it to get the same one. So tapflow setup brings the whole host environment up in as close to one command as the toolchain allows.

https://youtu.be/RTLBJIrHf9M

Here are the DX decisions behind it.

`doctor` diagnoses, `setup` installs and configures

We split diagnosis from the steps that install and configure.

tapflow doctor is read-only: it checks the prerequisites — Xcode, simctl, a simulator runtime; the SDK, adb, an AVD — and reports. It never changes your machine, so it's safe to run anywhere, anytime. A clean run looks like this:

tapflow doctor

  ✓  Node v20.11.0

  iOS
  ✓  Xcode 16.2
  ✓  xcrun simctl
  ✓  Simulator available (8)

  Android
  ✓  Android SDK: ~/Library/Android/sdk
  ✓  adb found: ~/Library/Android/sdk/platform-tools/adb
  ✓  AVD available: tapflow-phone

  All checks passed.

When something's missing, each failing line carries the exact fix — ⚠ AVD → No AVD found. Run: tapflow setup android — so doctor always points straight at setup.

tapflow setup is the one command allowed to install and configure. The two mirror each other, so the mutating verb lives in exactly one place. Run setup with no argument and it reads the environment:

if (process.platform === 'darwin') platforms.push('ios')
if (resolveAdb() !== null) platforms.push('android')

macOS implies iOS; an existing adb implies you care about Android. If neither signal is there, it asks instead of guessing.

iOS: installed ≠ usable

Xcode can only come from the App Store, so setup doesn't fake it — it opens the right page and waits.

The part we kept getting wrong was that a freshly installed Xcode isn't a working one. Three steps stand between "the app exists" and "xcodebuild runs": point the active developer directory at Xcode (xcode-select -s), accept the license, and finish first launch. setup runs them for you, after asking, since they need sudo. The check that matters isn't "does Xcode.app exist" — it's whether xcodebuild -version actually runs.

Android: a self-contained SDK

This is the decision we're happiest with.

The obvious path is "install Android Studio." We didn't. The host doesn't need a GUI IDE, and depending on one means fighting whatever SDK location, ANDROID_HOME, and AVDs the user already has. Instead setup builds a self-contained SDK under one path we own:

sdkmanager --sdk_root=~/Library/Android/sdk \
  "cmdline-tools;latest" "platform-tools" "emulator" \
  "system-images;android-35;google_apis;arm64-v8a"

After that, every Android binary tapflow touches comes from inside that directory. A couple of details that make it reliable:

sdkmanager needs a JDK or it won't run, so a Temurin check happens first.
The system image is google_apis, not the Play Store one, which is unstable the way we drive it.

The one thing that has to outlive the process is ANDROID_HOME on your PATH. setup writes it into your shell rc inside a marker block, and only if it isn't already there — so re-running never duplicates it:

# >>> tapflow android sdk >>>
export ANDROID_HOME="$HOME/Library/Android/sdk"
export PATH="\(ANDROID_HOME/platform-tools:\)ANDROID_HOME/emulator:$PATH"
# <<< tapflow android sdk <<<

The catch it can't remove: the variable isn't in your current shell, so setup tells you to open a new terminal before doctor.

setup prepares; the relay boots

For both platforms, setup stops at "a bootable device exists." It never boots a simulator or emulator. That's the relay's job — it boots the right device on demand when a teammate joins a QA session. Two components owning device lifecycle would just race each other.

One rail under all of it

Every step that changes the machine asks first, and only auto-runs in an interactive terminal. Run setup in CI and instead of curling an install script as root, it prints guidance and exits clean. Nothing in tapflow deletes your data — the only teardown command is reset, which shuts down running simulators.

Honest limitations

Xcode is still a manual App Store download. There's no API for it; setup automates everything around it.
The first run usually needs a new shell for the Android PATH to take effect.
The agent host is a Mac, since that's the only place iOS simulators run. (The relay itself runs on Linux.)
Still v0.x, so the steps will keep moving as the toolchain shifts.

Takeaway

The downloads were never the hard part. The DX lives in the glue around them — Xcode activation, the PATH that needs a fresh shell, knowing to stop at "bootable" and let the relay boot the rest. Automating a dev environment means automating everything that isn't the install.

Try it

tapflow is MIT licensed.

npm install -g tapflow
tapflow doctor     # what's missing?
tapflow setup      # set it up
tapflow start

🔗 GitHub: https://github.com/jo-duchan/tapflow
📖 Docs: https://www.tapflow.dev

We switched simulator streaming to H.264 and it felt worse. Here's how we fixed the latency.

duchan jo — Wed, 10 Jun 2026 07:13:34 GMT

In an earlier post I described how tapflow streams iOS simulators to the browser: pull frames off the simulator's IOSurface, JPEG-encode them on the Mac, push them over WebSocket at ~30fps.

JPEG has one great property for interactive streaming: every frame is independent and decodes instantly. There's no buffer, no inter-frame dependency. On localhost it feels like you're touching the simulator directly.

It also has one terrible property: size. A full-frame JPEG of a scrolling screen is ~590KB. On a LAN that's 12–16 MB/s, and our relay started dropping 16–27 frames a second under backpressure — visible tearing.

So we did the obvious thing and moved to H.264. Bandwidth dropped roughly 140× on a still screen and 5× while scrolling. Drops nearly vanished.

And the stream felt worse.

This post is about why, and the two fixes that got H.264 back to "feels like direct touch."

The bar: localhost JPEG

Before touching anything I needed a number, not a vibe. So I instrumented the pipeline end to end — a per-stage panel that reports decode→present and glass→glass (capture timestamp to on-screen) latencies live.

One caveat I'll repeat throughout: glass→glass absolute values are only valid on localhost, where capture and display share one clock. decode→present is a same-machine delta and valid anywhere, so I'll lean on it for the cross-environment claims.

Here's the baseline that mattered, measured on localhost:

Path	decode→present p50/p95 (ms)
JPEG still	12.4 / 15.4
JPEG scroll	9.4 / 11.6
H.264 (WebCodecs) still	267 / 274

H.264 decode was ~20× slower than JPEG. On a hardware decoder. That made no sense — until I looked at what the decoder was actually doing.

Fix 1: the decoder was buffering 8 frames for no reason

The transport was clean (~1ms), the input queue was empty. The latency was entirely inside the decoder: it was holding ~8 frames before emitting the first one.

That's a DPB (decoded picture buffer). A decoder reorders frames when B-frames are present — it has to wait for future frames to arrive before it can output the current one in display order. So it buffers up to the level's maximum.

But our encoder is baseline H.264, B-frames off. There is no reordering. The actual reorder depth is zero. The decoder was buffering anyway because the bitstream never told it the reorder depth was zero.

The signal lives in the SPS (sequence parameter set), in the bitstream_restriction flags inside VUI. Our VideoToolbox encoder wasn't setting them, so the decoder fell back to the worst case for the level — max_dec_frame_buffering of ~8 frames at Level 5.0.

The fix is to rewrite the SPS and inject the missing declaration:

max_num_reorder_frames = 0
max_dec_frame_buffering = num_ref_frames

We do this in the agent, on the keyframe SPS, before the frame ever leaves the Mac — so every decoder downstream benefits, not just one browser path:

// agent-core/utils/sps.ts — rewrite the SPS to declare zero reordering
function rewriteLowLatencySps(sps: Uint8Array): Uint8Array {
  const bits = new BitstreamWriter(parseSps(sps))
  bits.vui.bitstreamRestriction = true
  bits.vui.maxNumReorderFrames = 0
  bits.vui.maxDecFrameBuffering = bits.numRefFrames
  return serialize(bits)
}

Result on localhost:

Path	decode→present p50/p95 (ms)
H.264 WebCodecs still (before)	267 / 274
H.264 WebCodecs still (after)	2.5 / 4
H.264 WebCodecs scroll (after)	2.1 / 3.9

267 → 2.5ms, roughly 100×. The encoder was lying to the decoder by omission, and the decoder defended itself by buffering. One declaration fixed it.

The browser confirms it's receiving the rewrite — the SPS now reports bitstreamRestriction: true, maxNumReorderFrames: 0.

Fix 2: MSE is a buffer you can't turn off

Fix 1 only helps the WebCodecs path. And WebCodecs has a hard constraint: it only runs in a secure context — HTTPS or localhost.

A team using tapflow over their LAN hits it at plain http://:4000. That's a non-secure context, so the browser can't use WebCodecs. The fallback at the time was MSE (Media Source Extensions): feed the H.264 into a element through a muxer.

The problem is that is a buffer. It's designed for media playback, where a jitter buffer is a feature. For interactive streaming it's structural latency you can't remove. I measured it on localhost by forcing the MSE tier:

Path	decode→present p50/p95 (ms)
H.264 MSE still	239 / 254
H.264 MSE scroll	229 / 244

~235ms, on the same reorder=0 stream that WebCodecs decoded in 2.5ms. The SPS fix can't reach this — it's the media-element buffer, not the decoder's DPB. I'd already set the muxer's flushingTime to 0. There was nothing left to shave.

So I stopped trying to make MSE fast and removed it.

The decoder layer is now two tiers, picked automatically per environment:

// pickDecoder — secure → WebCodecs, otherwise WASM
export function pickDecoder(): Decoder | null {
  if (isSecureContext && 'VideoDecoder' in window) {
    return new WebCodecsDecoder()      // HW, lowest latency
  }
  if (webgl2Available && wasmSupported) {
    return new WASMDecoder()           // tinyh264, zero-buffer
  }
  return null                          // → fall back to JPEG
}

On non-secure LAN-HTTP, we decode H.264 in WASM (tinyh264). It's a software decoder, so it costs CPU — but it has no media-element buffer at all. That's the whole point: it gives you JPEG's immediacy with H.264's bandwidth, on plain HTTP.

Measured on localhost (the worst case — encoder and decoder share one Mac):

Path	decode→present p50/p95 (ms)
H.264 WASM still	8.7 / 30.4
H.264 WASM scroll	14.3 / 37.9

That's on par with the localhost-JPEG baseline (12.4 / 9.4) — the bar we set at the start. Removing MSE also let us drop the muxer dependency entirely.

One constraint this introduces: tinyh264 only decodes baseline H.264. iOS already encodes baseline. For Android we pin scrcpy to baseline (profile:int=1) so both platforms share the exact same HTTP→WASM path. High profile is still available on the WebCodecs (secure) tier.

One more thing: dropping H.264 isn't like dropping JPEG

There's a subtlety the switch exposed. With JPEG, every frame is a keyframe, so dropping a frame under backpressure is harmless — the next one stands alone. With H.264, if you drop a P-frame, every following P-frame references something the decoder never received. A zero-buffer decoder like WASM tinyh264 shears until the next IDR arrives.

So the relay had to become keyframe-aware: once it starts dropping under backpressure, it drops the whole GOP until the next keyframe, rather than handing the decoder a broken reference chain. The keyframe flag rides in our frame envelope, so this needs zero NAL parsing on the relay.

// relay — once dropping, drop until the next keyframe
if (backpressured) {
  if (!frame.isKeyframe) return       // skip P-frames in a broken GOP
  dropping = false                    // keyframe resets the chain
}

Honest limitations

WASM decode is CPU-bound. At high resolution × fps it hits a CPU ceiling. We mitigate by downscaling the encode resolution — the display is small, so it's a triple win on bandwidth, CPU, and latency.
The localhost numbers are best-case for latency and worst-case for CPU. On a real LAN the decoder runs on a separate machine. In our cross-machine measurements, scroll p95 climbs to ~50ms on both decoders — at that point the bottleneck is load/transport, not the codec. The decode→present deltas above hold; the glass→glass absolutes do not transfer across two clocks.
Still v0.x. The decoder tiers and SPS rewrite are in agent-core; expect them to keep moving.

Takeaway

Two bugs, same symptom ("H.264 feels laggy"), completely different causes:

The decoder's DPB buffered 8 frames because the SPS didn't declare reorder=0. Fix: rewrite the SPS at the encoder.
The media-element buffer in MSE added ~235ms that no encoder flag can reach. Fix: remove MSE, decode in WASM on non-secure contexts.

The lesson I keep relearning: when streaming feels slow, measure each stage before you change the codec. The codec usually isn't the problem — the buffer you didn't know you had is.

Try it

tapflow is MIT licensed.

npm install -g tapflow
tapflow start

🔗 GitHub: https://github.com/jo-duchan/tapflow
📖 Docs: https://www.tapflow.dev

Giving an LLM Eyes and Hands on a Mobile Simulator

duchan jo — Sun, 31 May 2026 05:39:53 GMT

Mobile QA has a scaling problem.

Unit tests and API tests run in CI automatically. But the thing that actually matters to most users — does tapping this button do the right thing, does this screen look right after this flow, does the deeplink open the correct state — none of that runs automatically. Someone has to open the simulator, walk through the steps, and verify. Every time.

The usual answer is Appium or XCUITest. But those require engineers to write and maintain test code that mirrors the UI, breaks whenever the screen changes, and only runs against builds developers already have locally.

We had a different idea. tapflow already lets humans control a simulator through a browser. What if we gave an LLM the same interface?

The interface a human uses

When a person does QA in tapflow, the loop is:

Look at the simulator screen
Decide what to do (tap, swipe, type)
Do it
Look again

This is exactly the perception-action loop that vision-capable LLMs are built for. The model sees a screenshot, reasons about what it shows, decides what action to take, and calls a tool to execute it.

We didn't need to build a new automation layer. We just needed to expose tapflow's existing WebSocket and REST APIs as MCP tools.

What the MCP server does

@tapflowio/mcp-server connects to a running tapflow relay and registers 13 tools that any MCP-compatible client can call:

list_devices       — see all simulators registered on the relay
connect_device     — join a device session
boot_device        — boot a simulator (waits up to 30s for ready state)
screenshot         — capture the current screen
tap                — tap at a pixel coordinate
swipe              — swipe between two coordinates
type_text          — type into the focused field
press_key          — press a keyboard key (Return, Delete, Escape...)
press_button       — press a hardware button (home, lock)
install_app        — install a build from App Center
launch_app         — launch an installed app
list_builds        — list available builds on the relay
disconnect_device  — end the session

Setup is two environment variables:

TAPFLOW_RELAY_URL=wss://your-relay-url
TAPFLOW_TOKEN=your-pat-token
npx @tapflowio/mcp-server

Add it as an MCP server in your client config, and those tools appear in the model's tool list.

How the tools are implemented

Screenshot — the model's eyes

The screenshot tool calls the REST endpoint we added in v0.3.0 (GET /api/v1/sessions/:id/screenshot), gets back a PNG or JPEG buffer, base64-encodes it, and returns it as MCP image content alongside the pixel dimensions:

return {
  content: [
    { type: 'image', data: buf.toString('base64'), mimeType },
    { type: 'text', text: `Screenshot saved: \({filePath} (\){width}×${height}px)` },
  ],
}

The model receives the actual image. It can read text on screen, identify UI elements, notice error states — the same things a human would.

Tap and swipe — normalized coordinates

Here's the part that took a few iterations to get right. The simulator's logical coordinate space is different from screenshot pixel coordinates, and it changes with screen resolution, device type, and scale factor.

Rather than exposing logical coordinates (which the model can't reason about without device-specific knowledge), we have the model work entirely in screenshot pixel space. The tap tool takes pixel coordinates plus the screenshot dimensions, then normalizes internally:

// tools.ts
client.tap(sessionId, x / screenshotWidth, y / screenshotHeight)

The model calls screenshot first, reads the dimensions from the response, then uses those same dimensions when calling tap. This means the model can identify "the button is at roughly pixel 200, 450" from the image and tap it directly — no coordinate system translation required.

Swipe works the same way, with 8 interpolated touch:move events across the duration to simulate a natural gesture:

// client.ts — swipe interpolation
const STEPS = 8
const interval = durationMs / STEPS

this.send({ type: 'input:touch:start', sessionId, payload: { x: startX, y: startY } })
for (let i = 1; i < STEPS; i++) {
  await delay(interval)
  const t = i / STEPS
  this.send({
    type: 'input:touch:move',
    sessionId,
    payload: {
      x: Math.round(startX + (endX - startX) * t),
      y: Math.round(startY + (endY - startY) * t),
    },
  })
}

Async operations over WebSocket

Several tools involve async operations — booting a device, installing an app — where the relay sends a confirmation back over WebSocket after the operation completes.

The client uses a waitFor pattern: register a predicate against incoming messages, return a promise that resolves when a matching message arrives, and reject if a timeout fires first.

// client.ts — waitFor
private waitFor(predicate: (msg) => boolean, timeoutMs: number): Promise {
  return new Promise((resolve, reject) => {
    const timer = setTimeout(() => {
      this.waiters.splice(this.waiters.findIndex(w => w.resolve === resolve), 1)
      reject(new Error('Request timed out'))
    }, timeoutMs)
    this.waiters.push({ predicate, resolve, reject, timer })
  })
}

boot_device waits up to 30 seconds. install_app waits 60 seconds. Each resolves on the confirmation message or rejects with the error payload.

What a session looks like

A model running a login flow might do this:

1. list_devices → pick a session
2. connect_device
3. list_builds → find the build to test
4. boot_device
5. install_app
6. launch_app
7. screenshot → see the login screen
8. tap(email field coordinates) → focus the input
9. type_text("test@example.com")
10. tap(password field coordinates)
11. type_text("password")
12. tap(login button coordinates)
13. screenshot → verify the home screen loaded
14. disconnect_device

Each screenshot gives the model a chance to verify state before proceeding. If step 13 shows an error message instead of the home screen, the model knows something went wrong.

Where we are: experimental

The version says 0.3.1-experimental.1 for a reason. The tools work, but the layer needs more hardening before we'd call it reliable.

The core issue is consistency. The same sequence of tool calls should produce predictable behavior every time. Right now it doesn't always — there are timing edge cases where an action fires before the UI has fully settled, device state can drift between steps without the model noticing, and error recovery when something unexpected happens mid-flow is rough.

These are solvable problems, but we want to solve them before presenting this as something teams should build pipelines on.

Where we're going: CI/CD without a QA script

The direction we're aiming at is using the MCP server as the foundation for LLM-driven smoke tests in CI.

The scenario: a new build passes unit tests and gets uploaded to App Center. A CI step spins up the MCP server, points it at the relay, and gives a model a natural-language test spec:

"Install the latest build. Log in with test credentials. Navigate to the cart, add an item, and confirm the checkout screen shows the correct total. Take a screenshot at each step."

The model does the steps, captures evidence, and reports what it saw. No automation code to write. No selectors to maintain when the UI changes. The spec is just a description of what a human would do.

This isn't production-ready yet. The stability work comes first. But the pieces — browser-controllable simulators, screenshot REST endpoint, MCP tool layer — are in place. The question is whether the model can run a flow reliably enough to be trusted in CI without a human verifying each run.

We think it can. That's what we're building toward.

Try the MCP server (experimental)

npm install -g @tapflowio/mcp-server@experimental

You'll need a running tapflow relay and a PAT token with viewer scope. Configure it in your MCP client:

{
  "mcpServers": {
    "tapflow": {
      "command": "npx",
      "args": ["@tapflowio/mcp-server"],
      "env": {
        "TAPFLOW_RELAY_URL": "wss://your-relay-url",
        "TAPFLOW_TOKEN": "your-pat-token"
      }
    }
  }
}

If you try it and hit rough edges, open an issue — that feedback is exactly what's shaping the stability work.

🔗 GitHub: https://github.com/jo-duchan/tapflow
📖 Docs: https://www.tapflow.dev/guide/mcp-server

tapflow v0.3.x: Deeplinks, Keyboard Shortcuts, Screenshot API, and an Experimental MCP Server

duchan jo — Fri, 29 May 2026 07:56:44 GMT

tapflow started as a simple idea: stream iOS simulators and Android emulators to the browser so anyone on the team can do mobile QA without touching Xcode or Android Studio. v0.2.x got the core working — streaming, touch input, App Center, session recording.

v0.3.x is about filling in the gaps that matter during actual QA sessions. This post covers what shipped and ends with something we're still figuring out: an experimental MCP server that lets LLM agents control simulators directly.

Deeplink execution from the browser

https://youtu.be/MQaikcQd37w

The one that came up most in real usage: testers frequently need to trigger deeplinks to verify specific app states — product detail pages, notification payloads, OAuth redirects. The old workflow always involved a mobile developer — either having them trigger it on their machine or building a debug menu inside the app specifically for this purpose.

In v0.3.0 you can now fire a deeplink directly from the QA session toolbar. Click the link icon (or ⌘K), enter the URL, and it executes on the active device.

Under the hood it's a new open-url WebSocket message type that routes browser → relay → agent:

Browser ──open-url──► Relay ──open-url──► Mac Agent
                                              │
                           iOS: xcrun simctl openurl booted 
                           Android: adb shell am start -a VIEW -d 
Browser ◄──open-url:done/error── Relay ◄──────┘

The DeviceAgent interface got a new openUrl(url) method, so both iOS and Android agents implement it symmetrically. The relay routes it and returns either open-url:done or open-url:error with the failure reason. The dashboard shows a toast either way.

Keyboard shortcuts for simulator controls

QA sessions are repetitive. Reaching for the toolbar icons on every screenshot or rotation adds up. v0.3.0 adds keyboard shortcuts to all the common actions:

Shortcut	Action
`⌘K`	Open deeplink dialog
`⌘S`	Take screenshot
`⌘⇧Y`	Start / stop recording
`⌘⇧O`	Rotate simulator
`⌘⇧U`	iOS: press Home
`⌘⇧K`	iOS: toggle software keyboard

Tooltips now show the shortcut hint inline, so they're discoverable without reading docs. One implementation detail worth noting: key detection uses e.code instead of e.key. This matters for IME input — Korean, Japanese, and Chinese users composing text would otherwise trigger shortcuts mid-composition.

Screenshot REST endpoint

This one unlocks a new class of CI usage.

GET /api/v1/sessions/:sessionId/screenshot returns a PNG or JPEG of the current simulator screen. You can call it with a PAT token from any CI step — before asserting a visual state, during an automated flow, after a build install.

The tricky part was the request/response pattern. The relay communicates with agents over WebSocket (long-lived, multiplexed), but HTTP is request/response. Screenshots are taken on the Mac, not the relay.

We introduced a requestId-based pending map: the relay generates a unique ID, sends a take-screenshot message to the agent over WebSocket, registers a promise keyed by requestId, and resolves it when screenshot:result comes back. The HTTP handler awaits that promise and sends the binary payload:

GET /api/v1/sessions/:id/screenshot
    │
    ▼
Relay: generate requestId, push to pending map
    │
    ├──screenshot-request──► Mac Agent
    │                            │ simctl io screenshot (iOS)
    │                            │ ADB screencap (Android)
    ◄──screenshot:result─────────┘
    │
    ▼
HTTP 200 (binary image)

iOS supports both PNG and JPEG via --type. Android returns PNG regardless — ADB doesn't offer format selection at this layer.

PAT scope enforcement

Personal Access Tokens existed before v0.3.0, but the scope field wasn't actually enforced on API routes. A developer scoped token could call any endpoint.

v0.3.0 adds proper scope checks to all builds endpoints. PATs are now enforced at the middleware layer: a token issued for builds access can upload and manage builds, but can't touch team settings or session data. This makes it safe to issue narrow tokens for CI pipelines without giving them broader access than they need.

Frame performance instrumentation

For anyone debugging streaming latency: v0.3.x adds per-frame hop timestamps via a binary header (TFFE — tapflow frame envelope). Each frame now carries the capture time, relay-received time, and client-received time in an 8-byte prefix before the JPEG/H.264 payload.

The dashboard can surface a live performance overlay showing frame latency broken down by segment (agent → relay, relay → browser). Useful when diagnosing whether a slowdown is in the network leg or the browser decode path.

Experimental: an MCP server

v0.3.x also ships @tapflowio/mcp-server (0.3.1-experimental.1) — it exposes tapflow's WebSocket/REST APIs as MCP tools so an LLM agent can drive a simulator the same way a human does in the browser: screenshot → reason → tap/type → screenshot again.

It's early (the experimental suffix is literal — consistency and error-recovery still need work), and it's a big enough topic to have its own write-up: Giving an LLM Eyes and Hands on a Mobile Simulator covers the full tool list, the normalized-coordinate tap/swipe, and where this is headed (LLM-driven smoke tests in CI).

npm install -g @tapflowio/mcp-server@experimental

Try it

npm install -g tapflow
tapflow start
# http://localhost:4000

🔗 GitHub: https://github.com/jo-duchan/tapflow
📖 Docs: https://www.tapflow.dev

Your whole team can now run mobile QA from the browser. Here's how we built it.

duchan jo — Thu, 28 May 2026 15:01:11 GMT

If you work on a mobile product, you've probably seen this.

Physical devices are never enough. Covering every OS version is even harder — iOS doesn't support downgrading, so maintaining a range of versions means managing a pool of locked devices, which is overhead nobody wants.

But the bigger friction is access. Simulators only run on a developer's Mac, behind complex toolchains. Anyone on the team who isn't a mobile developer has to ask one every single time they need to verify something:

Server / FE developer — "How do I install the sandbox build to check what was deployed?"

Product manager — "I keep having to install and remove different versions just to compare behavior."

Designer — "I need to check the layout across screen sizes, but I don't have the right devices."

Cloud simulator services exist. But uploading internal app builds to an external service — and paying monthly fees for simulators already running on Macs you own — was never something we wanted to do.

So we built tapflow: an open-source, self-hosted tool that streams iOS simulators and Android emulators to the browser. Anyone on your team opens the dashboard, picks a device, and starts interacting — no Xcode, no Android Studio, no setup.

npm install -g tapflow
tapflow start
# → http://localhost:4000

This post is about how we built it — specifically the parts that weren't obvious.

Demo Video

https://youtu.be/BfoS-i5aMcM

Why we didn't just use Appetize or BrowserStack

Both services solve the browser access problem. We evaluated them seriously. Before signing up, we hit two blockers:

Cost. Appetize starts at $59/month and scales with team size.
Data. Both require uploading your app binary to external servers. For anything with sensitive business logic, that's a non-starter.

We already had Macs in the office. So we built tapflow instead.

Architecture

Browser (your team)  ←─ WebSocket ─→  Relay Server  ←─ WebSocket (outbound) ─→  Mac Agent
                                     (Linux / Mac)                           (iOS · Android)

The Mac Agent connects outbound to the relay — no firewall or NAT configuration needed. The relay can run on a small Linux server (a ~$5/month Fly.io instance handles it). App data never leaves your infrastructure.

iOS touch — without WebDriverAgent

WebDriverAgent was the obvious starting point. We didn't use it.

The problems: WDA breaks on Xcode updates, requires provisioning profiles, needs the app to be in the foreground, and adds a layer of process management complexity we didn't want to own.

Instead, we load CoreSimulator.framework dynamically via dlopen in a Swift binary (touch-helper), then inject HID events directly through SimDeviceLegacyHIDClient and IndigoHID:

// touch-helper — HID event injection into the simulator
let client = SimDeviceLegacyHIDClient(device: device)
let event = IndigoHIDEvent.touch(x: x, y: y, phase: .began)
client.send(event)

This bypasses WDA entirely. It works independently of the app lifecycle and doesn't break on Xcode updates.

The tradeoff: these are private APIs. They've been stable across Xcode versions in our testing, but Apple could remove them. We think that's a better bet than WDA's reliability track record.

iOS streaming — IOSurface

xcrun simctl io screenshot works, but the latency is too high for interactive use.

Instead, we access IOSurface directly through SimulatorKit, pulling frames straight from the simulator's GPU surface. ~~Frames are JPEG-encoded on the Mac and streamed over WebSocket at ~30fps.~~

For slow clients, we drop frames rather than buffering — backpressure is handled at the WebSocket layer to prevent memory accumulation on the relay when a client can't keep up.

Update: JPEG was the first version. The default is now H.264 with a buffer-free 2-tier browser decoder (WebCodecs on secure contexts, WASM on plain HTTP). The full teardown — why H.264 first felt worse, and the two fixes that solved it — is a separate post: We switched simulator streaming to H.264 and it felt worse.

Android — scrcpy H.264 → WebGL

Android was cleaner. scrcpy already does the hard work of capturing the emulator display as an H.264 stream.

We receive the H.264 Annex B stream from scrcpy over a local TCP socket, relay it through WebSocket, then decode and render it in the browser. Android now shares the same buffer-free 2-tier decoder as iOS (see the update above).

scrcpy server (emulator)
    → TCP socket
    → Mac Agent
    → WebSocket
    → Browser (WebGL2)

Pinch gestures

scrcpy's INJECT_TOUCH_EVENT supports multiple pointer IDs. Pinch is implemented by sending two simultaneous touch events:

// ScrcpyControl — multi-touch injection
pinchStart(x1: number, y1: number, x2: number, y2: number): void {
  this.touchDown(0, x1, y1)
  this.touchDown(1, x2, y2)
}

What's included

Beyond streaming and input:

App Center — upload .app.zip (iOS) or .apk (Android), manage build status (Backlog / In Progress / Done / Rejected), REST API + Personal Access Tokens for CI/CD integration
Session recording — record and share QA sessions, kept for ~72 hours before automatic cleanup
Team management — invite links, role-based access (Admin / Developer / QA / Viewer)
Mac resource monitoring — CPU and RAM time-series charts per agent

Honest limitations

iOS simulators require macOS — Apple's constraint, not ours
One Mac typically handles 2–4 simultaneous simulators depending on RAM; connect multiple Macs to pool devices
Still v0.x — breaking changes may appear before v1.0

Try it

tapflow is MIT licensed.

npm install -g tapflow
tapflow start
tapflow init  # create the first admin account

For team deployments with a shared relay:

# Relay server (Linux/macOS)
JWT_SECRET=$(openssl rand -hex 32) tapflow relay start

# Each Mac agent
tapflow agent start --relay wss://your-relay-url

🔗 GitHub: https://github.com/jo-duchan/tapflow
📖 Docs: https://www.tapflow.dev

tapflow

The DX we wanted for tapflow setup: a host-ready Mac in one command

doctor diagnoses, setup installs and configures

iOS: installed ≠ usable

Android: a self-contained SDK

setup prepares; the relay boots

One rail under all of it

Honest limitations

Takeaway

Try it

We switched simulator streaming to H.264 and it felt worse. Here's how we fixed the latency.

The bar: localhost JPEG

Fix 1: the decoder was buffering 8 frames for no reason

Fix 2: MSE is a buffer you can't turn off

One more thing: dropping H.264 isn't like dropping JPEG

Honest limitations

Takeaway

Try it

Giving an LLM Eyes and Hands on a Mobile Simulator

The interface a human uses

What the MCP server does

How the tools are implemented

Screenshot — the model's eyes

Tap and swipe — normalized coordinates

Async operations over WebSocket

What a session looks like

Where we are: experimental

Where we're going: CI/CD without a QA script

Try the MCP server (experimental)

tapflow v0.3.x: Deeplinks, Keyboard Shortcuts, Screenshot API, and an Experimental MCP Server

Deeplink execution from the browser

Keyboard shortcuts for simulator controls

Screenshot REST endpoint

PAT scope enforcement

Frame performance instrumentation

Experimental: an MCP server

Try it

Your whole team can now run mobile QA from the browser. Here's how we built it.

Demo Video

Why we didn't just use Appetize or BrowserStack

Architecture

iOS touch — without WebDriverAgent

iOS streaming — IOSurface

Android — scrcpy H.264 → WebGL

Pinch gestures

What's included

Honest limitations

Try it

`doctor` diagnoses, `setup` installs and configures