Testing AI-built apps with QA agents

TL;DR:

When AI writes the code, you can’t trust the same AI to prove it works.
We built an independent, black-box QA agent that treats every deployed app like a real user would: clicking, navigating, and verifying outcomes.
Fast enough and cheap enough to run on every deploy.
We call this autonomous end-to-end QA: black-box testing that runs after deployment and returns QA snapshots, visual bug reports, and logs back into the chat.

We used to trust code because a human owned it. Someone wrote it, reviewed it, lost sleep over it, and could explain why it works. AI-built apps (often called vibe coding) break that assumption. Features can ship after a handful of prompts and automated diffs, and no one can honestly claim they understand every line that just changed. QA becomes the source of truth: automated, continuous proof that the system still behaves correctly.

In a chat-native deployment platform such as AppDeploy, this is doubly true. Here is what happens every time you deploy an app with AppDeploy: once the build completes, a QA agent automatically runs a test suite to verify the quality of the app. With AppDeploy, this starts as a test-driven development (TDD) approach where the tests are defined by what the app should do before implementation. The coding agent writes the tests first, then implements the app until the suite passes. If the agent finds bugs, AppDeploy returns structured feedback to the coding agent, including a description of each failure, the relevant screenshot, and browser console errors. The coding agent uses that feedback to fix the issues and calls AppDeploy to redeploy. QA reruns automatically, and the cycle repeats until the tests pass.

You prompt

↓

repeat until tests pass

Coding Agent builds

↓

Deploy on AppDeploy

↓

AppDeploy QA Agent Runs

tests pass

↓

Working App

A simple way to think about the loop:

You prompt
A coding agent builds
AppDeploy deploys
A QA agent runs end-to-end tests on the deployed app
Results go back to chat
The agent fixes and redeploys

What a QA agent does for AI-built apps

So how do you QA your AI-built apps?

You cannot vibe your way into confidence by asking the same agent to run a Playwright script once and call it done. When the builder and the checker share the same assumptions, you get agreement, not assurance. Real QA means independent evidence that the system works in the environment it will actually run in, across the flows that matter, with the dependencies it really uses, and under the failure modes it will eventually hit. The whole point is to replace “it looks right” with “it is proven.”

Why the coding agent can’t fully test itself (unit and end-to-end testing)

Writing code and proving code are different jobs

A coding agent is great at producing plausible implementations quickly. Verification is different: it needs an independent notion of truth and a willingness to fail the change. When the same agent is both builder and judge, the incentives and the failure modes line up in the wrong direction.

The same brain problem

If the agent writes the implementation and the tests, both tend to reflect the same mental model. That creates correlated failure: the tests pass because they validate what the agent thought the system should do, not necessarily what the system must do. You get internal consistency, not external correctness.

”Green” is easy to optimize for

Even without malicious intent, an agent optimizing for completion will drift toward low-resistance tests: happy paths, shallow assertions like “returns 200,” snapshots of whatever output it produced, or mocks that bypass the risky parts. What does a passing suite even mean, if the suite wasn’t designed to fail?

What this means for AI-built app platforms

Let the agent help generate tests and suggest scenarios, but don’t let it be the final authority. The platform needs independent, repeatable verification in realistic environments, with contracts and gates that can’t be waved away by a single “looks good” run.

Why real QA is difficult, especially with a backend

Given the limits from the last section, you end up doing something close to black-box QA. You treat the system like a user would. You click, you call APIs, you run flows, you verify outcomes. That gets you away from “same brain” testing and away from tests that are optimized to go green. It is also where things get hard.

State is a problem

Tests are only reliable if you can set up, control, and reset state, and that’s hard once real data and history enter the picture.

Backend state is harder than frontend state

Frontend state usually lives in one session; backend state lives across databases, caches, queues, and services, and it persists between runs.

Background jobs make tests flaky

If the system has async elements (queues, jobs, webhooks), the test has to guess when to check, so it sometimes checks too early and fails.

Failures, retries, and timeouts are a big part of QA

A lot of backend bugs only show up when something is slow or fails, then you get retries, partial success, or the same event processed twice.

How we built our QA agent

The goal is simple: treat the app like a user would. Click, type, navigate, and verify outcomes. Do it fast. Do it cheap. Do it reliably. Do it without “cheating” in ways a real user cannot.

Test plan: where it comes from

A coding agent is quite good at writing a thorough test plan. It can read the spec, routes, and UI structure and produce coverage that a human QA would recognize.

In AppDeploy, this is a test-driven development (TDD) approach where the tests are defined by what the app should do before implementation. The coding agent writes the tests first, then implements the feature. The catch is execution: the same agent that builds the feature will cut corners when it runs the plan. It will prefer the shortest path to “green.” So you split responsibilities: generation can be shared, verification must be independent.

Sanity vs full tests

We want two modes.

Sanity is what runs every time. It is short, high signal, and focuses on “is the app basically alive.” Login, one core flow, one write, one read, one permission check. The most well-trodden happy path.

Full is what runs when risk is higher. First deploy or big diffs. Auth changes. Payment flows. Migrations. New dependencies. Full is slower, but it is targeted. It should not be “run everything always.”

Speed vs accuracy

We need QA under a minute to keep the iteration loop tight. The hard part is that accuracy is expensive.

If we use a top-tier model for every tiny decision (every hover, every scroll, every “what button is this”), we might get slightly better decisions, but we blow the budget and the latency. If we send the model the entire DOM, a full screenshot, and the full interaction history on every step, we might get more context, but we slow everything down and drown the model in noise.

Building a QA agent is balancing these tradeoffs: when do we spend intelligence, and when do we run on rails. What context is essential, and what is just bulk.

Handling backend state

Backend state is the main thing that breaks black-box QA. We solve it by isolating runs so they can’t contaminate each other. That means a dedicated backend per run, or hard isolation at the database and queue level.

We then make state management explicit. Either we reset to a known baseline between tests, or we keep a shared baseline and only run tests that are designed to be state-safe. Once isolation and resets are real, we can safely run tests in parallel without one flow poisoning another.

The QA agent loop

At this point, the hard part isn’t the model. It’s the system around it: how we split work, what we send to the model, and how we keep runs fast, cheap, and reliable.

We build it to be cheap and fast, without letting it cheat. We use a multi-agent setup: a strong manager model plans, routes, and decides when to escalate, while smaller worker models execute steps quickly and in parallel. We parallelize whenever flows don’t share state, and we keep deterministic intervention points: hard timeouts, bounded retries, and clear escalation rules.

For perception, we use both VDOM and pixels. The VDOM is great when it’s clean and accessible. Pixels are how we survive overlays, canvases, and cases where the DOM lies. DOM compression into some VDOM is critical, otherwise we drown in noise and latency.

We keep actions as close as possible to real user actions. Click, type, select, scroll. We avoid injecting big scripts that do things a user can’t do, because that produces fake greens. And we handle the boring realities that decide whether black-box QA works: scroll and below-the-fold elements, modals and toasts, focus traps, mobile vs web viewports, and interaction differences across devices.

The feedback loop back into the coding agent

The point of black-box QA in an AI-built app platform is not a report. It’s a loop. QA produces a hard signal, and the coding agent uses that signal to make a focused fix, then QA reruns to confirm the fix actually worked.

We feed the coding agent more than just “pass or fail.” We expose what the app actually did during the run: runtime errors, browser console logs, and the error logs produced by the QA cycles themselves. That extra telemetry is what turns a vague failure into something the agent can debug. The loop is simple but essential: run black-box QA, capture the smallest set of high-signal errors, patch, rerun the same flow, repeat until it’s clean.

What this unlocks

AI-built apps make building cheap. QA agents make shipping cheap.

Once black-box QA is automated and wired into the loop, a few things become possible at the product level. We can let users iterate aggressively without turning every prompt into a production gamble. We can catch breakage the moment it’s introduced, not when a user stumbles on it. And we can keep raising quality over time, because every failure becomes a new guardrail for the next run.

Key takeaway

If code can appear without a human author, confidence has to come from somewhere else. A QA agent is that somewhere else: continuous, external verification that turns AI-built apps from demos into software you can actually run. On AppDeploy, this runs automatically on every deploy, closing the loop between the coding agent and the live app.

FAQ

Can an AI agent test its own code?

It can help generate test plans and suggest scenarios, but it should not be the final authority. When the same agent writes the code and the tests, both reflect the same assumptions, which creates correlated failures. Independent verification in the real runtime environment is what produces trustworthy results.

What is black-box QA for AI-built apps?

Black-box QA means testing the deployed app the way a real user would: clicking, navigating, calling APIs, and verifying outcomes, without access to the source code or internal state. This avoids the same brain problem where the builder’s assumptions leak into the tests.

How fast does QA need to be for AI-built apps?

Fast enough to run on every deploy without breaking the iteration loop. On AppDeploy, the QA agent completes a sanity pass in under a minute. A full test suite runs when risk is higher (first deploy, auth changes, new dependencies) and takes longer, but is targeted rather than exhaustive.

How does AppDeploy handle QA for apps with a backend?

Each QA run gets an isolated backend environment so tests cannot contaminate each other. State is either reset to a known baseline between tests or tests are designed to be state-safe. This isolation makes it possible to run flows in parallel without one poisoning another.

Software, on demand - Why we are building AppDeploy and what chat-native deployment means.
How AppDeploy works - The deployment flow from prompt to live URL.
AppDeploy vs Vercel - Practical comparison of chat-native vs repo-first deployment.