Tautik Agrahari

computer use agent story

a cua (computer use agent) is just an llm that controls a browser. you give it a task like "log into some saas dashboard and screenshot the encryption settings", it takes a screenshot, sends the screenshot to claude with a list of tools (click, type, scroll), claude returns a tool call like click(x=420, y=180), you apply that to the browser, take another screenshot, and loop. that's the whole thing. it's a while not done: screenshot → llm → tool call loop with a real browser on the other end.

the part that's not obvious is where the browser actually runs. it can't run on the user's laptop (you'd need them online and giving you access). it can't run on your api server (you'd be sharing one chrome between every request). so it has to run on a remote linux box somewhere, with chrome inside it, and your backend talks to that chrome over the network. that "remote linux box with chrome" is the whole infra problem. everything else — the prompts, the planner, the auth, the recording — is built around that one primitive.

the screenshot loop, slightly less hand-wavy

claude (and openai's models, and gemini) ship "computer use" tools — basically a function spec that says "i can call click(x, y), type(text), scroll(direction) etc." you wire that up like any other tool call. on each turn:

  1. take a screenshot of the browser
  2. send it to claude with the user's task and the tool spec
  3. claude returns one tool call
  4. you execute it on the browser
  5. repeat until claude says <<TASK_COMPLETE>> or hits a max-turn limit

that's it. the model is doing all the perception (where's the button? what does the screen say?) and all the planning (what do i click next?). you're just the hands.

the screenshot loop

you might ask, tautik — what drives the browser? playwright. python library that speaks cdp (chrome devtools protocol — same thing your devtools panel uses when you press f12). when chrome launches with --remote-debugging-port=9222, anyone with the websocket url can connect and drive it — click, type, screenshot, run js, intercept network. so your backend does await playwright.chromium.connect_over_cdp(url) and from then on it's as if the chrome is on your laptop, except the actual browser is on a vm somewhere.

where chrome runs — the actual hard problem

aight so two paths.

option 1: rented browser-as-a-service. vendors like browserbase give you back a cdp_url and a stream_url and call it a day. their problem to figure out where chrome runs and how to keep it alive. easy day one. cost: they decide which features get exposed (recordings? proxies? auth storage?) and how their pricing scales. you get whatever they ship.

option 2: raw linux vm + install chrome yourself. services like e2b give you firecracker-as-a-service — AsyncSandbox.create() returns a root vm with commands.run(), files.read/write, port exposure. no browser, no display, no vnc, no recording. you trade up on control, you trade down on convenience.

most teams should start with option 1. you'll know when it's time to move down a layer — usually when you keep hitting the vendor's ceiling on something (recording, proxies, custom flags, snapshot timing). don't migrate prematurely; you're signing up for a stack you now own end to end.

the remote display stack, demystified

if you go option 2, here's what you actually need on the vm. linux servers don't have monitors, but chrome (when not running headless) wants to draw to a screen. so:

so the chain looks like:

the remote display stack

none of this is novel — vnc has been around forever and people have been remote-debugging headless linux for two decades. the only thing you have to do is orchestrate the startup order (xvfb → xfce → x11vnc → novnc → chrome) and wait for cdp to respond on port 9222 before the agent connects.

apt install xvfb xfce4 x11vnc novnc chromium-browser and you're ~80% there. there are even pre-built ubuntu templates floating around with all of this preinstalled, which saves a lot of fiddling.

the Computer protocol — the small trick that buys you everything

this is the smartest 30 lines of code you'll write. before introducing it, your agent loop will have provider-specific calls baked in — vendor.start_browser(), vendor.click() — and swapping providers later means a rewrite.

so define a typing.Protocol called Computer with the methods the loop actually uses — screenshot, click, type, scroll, keypress, drag, wait, move, double_click, get_environment, get_dimensions. the loop calls those. anything that implements those methods is a valid Computer.

class Computer(Protocol):
    async def screenshot(self) -> bytes: ...
    async def click(self, x: int, y: int) -> None: ...
    async def type(self, text: str) -> None: ...
    # ... ~8 more

VendorAClient is a Computer. RawVMChromeWrapper is a Computer. tomorrow if you want a third provider, that's another Computer. the loop doesn't know or care. and you can mock a Computer with 30 lines for a unit test.

the Computer protocol seam

this is just the strategy pattern with python's structural typing. nothing fancy. but it's the difference between "ship a new provider, swap a factory" and "rewrite the agent."

the stuff you actually have to write

ok so the vm is up, chrome is running, the loop is talking to it. what's still left?

auth

big one. when a user says "log into my snowflake instance and pull the audit logs" — how does the agent log in?

you don't do oauth. trying to script through "sign in with google → enter password → 2fa" with an llm is a nightmare and most enterprise saas blocks it as suspicious activity anyway.

what you do instead is capture once, replay forever. once, during auth setup, a real user logs in interactively while watching a vnc stream of your vm. when they confirm "yes i'm logged in", you call playwright's context.storage_state() which returns a json blob of cookies + localstorage + sessionstorage. you store that json (encrypted, scoped to the org). every future task that needs that platform pulls the json and injects cookies + storage into a fresh playwright context. the user appears already logged in. no oauth, no passwords stored, no refresh token handling.

if the customer's site uses oauth (sign in with google), the user does the dance once during capture and you keep the resulting session cookies. you never see refresh tokens.

why not use a vendor's "auth as a service"? because vendor auth blobs are opaque, in their proprietary format, and lock you in. storage_state is playwright's native primitive — works on any vm, anywhere playwright runs. when you swap providers, the auth blobs come with you untouched. when a vendor offers a fancy feature wrapping a primitive you already know, take the primitive.

proxies

datacenter ips get flagged by anti-bot waf instantly. cloudflare, akamai, datadome — they all maintain ip reputation lists and your aws/gcp/e2b egress ip is on every one of them. options:

most teams end up with a tri-state mode: off (default datacenter ip), on (rotate through your pool), forced (always use a specific clean ip for sensitive flows). pre-install tinyproxy or similar in your template and pass --proxy-server=... when chrome launches.

recording

your customers will ask "show me what the agent did" and you'll need video + network logs. neither comes free.

screen capture: ffmpeg reading the x11 framebuffer → mp4. one ffmpeg process per task, kicked off when the loop starts.

network capture: tcpdump with a bpf filter that excludes port 9222 (so you don't record your own cdp traffic) → har conversion → zstd compression.

both upload to s3 during cleanup. make them non-fatal — if ffmpeg crashes mid-task, log a warning, don't kill the agent. observability shouldn't be on the critical path.

custom vm template

stock templates are fine but cold start sucks (~30s to boot xvfb + xfce + x11vnc + novnc + chrome). the trick is to pre-start the entire stack at build time and snapshot the running state. when a sandbox boots from the snapshot, everything is already running. cold start drops to ~7s.

bonus optimization: pre-warm chrome's profile during build. run a headless chrome at build time loading google.com, sleep 8s so all the sqlite databases get created (cookies, history, web data, preferences), then kill it. saves 15–20s per cold start because chrome doesn't need to bootstrap profile dbs at runtime.

bot detection — the day-1 problem you won't solve

honestly. five layers of "avoid getting flagged" is the best most teams do:

you're not going to beat cloudflare's bot detection. nobody does, including the bot-detection vendors who claim to. plan around it — surface the captcha bail to the customer, let them decide whether to retry from a residential ip or hand off to a human.

how do you even test this

honest answer: not as well as a textbook would tell you to. unit tests for an agent are theatre. mock the screenshot, mock claude's response, mock the click — congrats, you've tested that python can call mock.assert_called. you haven't tested that the agent can actually log in anywhere.

the failure modes that matter are at the screenshot + llm + browser interaction layer, and you can't unit-test those — you need a real browser, a real model, a real website. so skip the theatre.

what actually catches regressions:

is this rigorous? not by traditional standards. but agents fail in ways traditional tests can't catch — flaky vendor pages, css that changed yesterday, a captcha the model hadn't seen before, a button label that became an icon. the only honest test is running the thing. so focus your infra on running the thing well and watching it carefully.

once you have customers, build a small internal dataset of tasks shaped like real customer flows (anonymized) and run them as a webvoyager-style nightly job with traced scoring. that's the obvious next move past the prototype phase.

scaling — what actually breaks at volume

a cua task isn't a 50ms api call. it's a 60-second-to-30-minute browser session holding an sse stream open and a vm warm on the other end. the python web-server defaults are tuned for the opposite shape — short requests, lots of them — so you have to retune.

the things that bite you:

cleanup is where things get spicy at scale. when a task ends — successfully or via cancellation — you have an ffmpeg process, a tcpdump process, a chrome instance, a playwright context, and a vm sandbox to tear down. in that order. if you kill() the sandbox first, the filesystem is gone and your recordings never upload.

a few rules that will save you:

if cold start matters — and at any real volume, it does — pre-warm the entire desktop stack at template build time. xvfb, xfce, x11vnc, novnc, chrome all running before the snapshot is taken. boot from snapshot drops cold start from ~30s to ~7s. you're paying once at build time so every task pays nothing at runtime. one of the highest-leverage optimizations in the whole stack.

the three things to take away

first, the Computer protocol. ~30 lines of structural typing. the seam that lets you swap providers and run multiple orchestrators on top of the same vm. structural typing > inheritance for plug-in surfaces.

second, storage_state over oauth. you don't do oauth; you capture sessions. portable across providers, opaque-vendor-blob-free, plays nicely with playwright. when a vendor offers a fancy feature wrapping a primitive you already know, take the primitive.

third, fork-vs-depend. for a layer this load-bearing — the layer between your agent and the world — owning it pays off. cost is the bugs you inherit (templates carry their own quirks). benefit is total control over what gets exposed and when. on a primitive like this, control wins.

aight that's the cua story. nothing here is magic. it's a screenshot loop, a remote desktop stack from 1998, a protocol seam, and a lot of patience for bot detection. you're paid to solve a problem, not to ship the fanciest architecture.