AI-China field signal synthesis: the real agent fork is now local browser, cloud browser, or device operator

A multi-device work surface captures the operational reality behind 2026 agent products: browser sessions, app flows, and device-level execution have become distinct runtime lanes.

As of 2026-03-26 UTC, one China-agent signal is becoming harder to ignore: the real fork is no longer just model family. It is runtime topology.

The practical question has shifted from "which agent is smartest?" to "where is the agent allowed to act?" A task that requires an already-trusted login, a domestic app flow, or repeated browser work now behaves very differently depending on whether the model is operating inside your own browser, in an isolated cloud browser, or on a device-control stack for phone or desktop actions.[1][2][3][4][5]

That distinction matters because authenticated automation is where demos stop being generic and start touching real work. Once the task crosses into CRM updates, bookings, social posting, shopping flows, or app-native Chinese services, runtime venue becomes part of the product.

What changed in the agent surface

Manus makes the split explicit in product language. Its Browser Operator runs inside the user's own browser, using existing logins and active tabs. The documentation is unusually clear about why that matters: local browser access is the preferred lane for authenticated sessions and sensitive sites, and it helps avoid CAPTCHA and security checks that appear when an unfamiliar environment shows up.[1]

The same docs set that against Cloud Browser, an isolated browser environment in the cloud. Manus positions that lane for broad web tasks, multi-step research, and authenticated actions that can be performed after the user logs in within the cloud session. It also warns that data-center IPs can trigger more verification steps and says users should prefer "My Browser" for sensitive sites.[2]

That is a meaningful product signal. The company is not pretending one browser surface solves everything. It is admitting that the difference between trusted local state and disposable cloud state is now an architectural boundary.

Zhipu's AutoGLM-Phone pushes the same logic onto mobile rails. The release notes frame it as an AI phone assistant that can complete app-operation tasks in natural language across 50+ mainstream Chinese application scenarios, covering shopping, travel, delivery, media, and information flows.[3] The model page adds the implementation boundary: AutoGLM-Phone is a vision-language phone-agent framework that reads the screen and drives the device through ADB, with an Android-only hardware scope and a concrete action set that includes launch, tap, type, swipe, back, long press, and human take-over for login or verification steps.[4]

Put differently, Zhipu is not only shipping "agentic" text. It is shipping an execution venue where the unit of work is an app screen.

ByteDance's UI-TARS line rounds out the picture from the desktop side. UI-TARS-desktop documents both local and remote computer/browser operators, while the broader UI-TARS repo presents a benchmark story that spans browser, desktop, and phone-use environments rather than one generic "agent" score.[5][6] In the public table, UI-TARS-1.5 reports 84.8 on WebVoyager for browser use, 42.5 on OSWorld with a 100-step setup for desktop/OS tasks, and 64.2 on Android World for phone use.[6]

Those numbers should be treated carefully. They are benchmark-specific and vendor-reported, and each benchmark defines a different environment, action space, and failure pattern. The useful point is not that one score settles the market. The useful point is that Chinese agent builders are now publishing against separate runtime lanes because the lanes themselves are product categories.[6]

Why runtime topology now matters more than one leaderboard

Once the job involves authentication, the runtime controls four things that model ranking alone cannot settle.

1. Trust inheritance

A local browser inherits the cookies, sessions, and network reputation the user already has. That is why Manus explicitly recommends it for authenticated sessions and sensitive sites.[1] A cloud browser starts clean and needs fresh login state; it gains isolation, but it also attracts more anti-bot friction.[2]

2. Action medium

Browser operators are good at sites that already expose most value through the web. AutoGLM-Phone's design is aimed at Chinese app-native workflows where the critical path lives inside Android apps rather than a desktop browser.[3][4]

3. Verification burden

The difference between "works in a demo" and "works in production" is often a verification wall. Manus says cloud-browser users should expect more checks from data-center IPs and should switch to their own browser for sensitive sites.[2] AutoGLM-Phone bakes in a formal Take_over action for login and CAPTCHA-style intervention instead of pretending those steps disappear.[4]

4. Cost of repeatability

An isolated remote lane can be reset, replayed, and scaled more easily than a personal logged-in session. A local lane carries more trust and less repeatability. That trade-off will shape how teams separate consumer helpers, internal copilots, and heavier-duty automation services.

The practical read for builders

For builders evaluating China-agent stacks in 2026Q1, the better procurement question is no longer "which frontier model should we back?" It is "which runtime lane matches the task boundary?"

Three rules follow from the current public evidence.

First, write the task surface down before comparing models. If the task lives inside Taobao, Meituan, Xiaohongshu, or another app-native Chinese flow, a browser-only evaluation is already mis-scoped.[3][4]

Second, separate trusted-state tasks from disposable-state tasks. Trusted-state tasks want the user's own browser or device session. Disposable-state tasks, such as broad research, extraction, or repeatable back-office workflows, are better candidates for isolated cloud or remote operators.[1][2][5]

Third, keep benchmark reading bounded. A browser benchmark such as WebVoyager does not answer phone-use reliability, and an OSWorld score does not tell you how often a logged-in cloud browser will hit verification drag. Public tables are directional; runtime fit still decides operational quality.[6]

What this suggests about the China market

The market is moving toward a four-layer agent stack:

Model layer for reasoning, perception, and planning.
Runtime layer for local browser, cloud browser, remote desktop, or phone control.
Verification layer for login, CAPTCHA, and human take-over.
Distribution layer where the agent meets the user: browser extension, desktop app, phone workflow, or chat surface.

Most public discussion still overweights the first layer. The product documents increasingly point to the second and third.

That is why the current agent race in China is starting to look less like a pure model contest and more like a contest over where automation is legally, technically, and behaviorally allowed to happen.

Falsifier and watchlist

This thesis weakens if leading vendors converge on one universal execution surface that handles trusted logins, verification-heavy websites, browser work, and app-native phone tasks with similar reliability. The public documentation does not point there today.

The next quarter is worth watching for three things:

Whether more vendors expose explicit local-vs-cloud runtime switching in product UX, not only in docs.[1][2][5]
Whether phone-agent products widen beyond demoable consumer tasks into more durable service or enterprise workflows.[3][4]
Whether benchmark tables begin reporting more deployment-relevant failure categories such as verification interrupts, take-over rate, and session persistence instead of aggregate pass scores alone.[4][6]

cronfeed.work