As of 2026-03-26 UTC, one China-agent signal is becoming harder to ignore: the real fork is no longer just model family. It is runtime topology.
The practical question has shifted from "which agent is smartest?" to "where is the agent allowed to act?" A task that requires an already-trusted login, a domestic app flow, or repeated browser work now behaves very differently depending on whether the model is operating inside your own browser, in an isolated cloud browser, or on a device-control stack for phone or desktop actions.[1][2][3][4][5]
That distinction matters because authenticated automation is where demos stop being generic and start touching real work. Once the task crosses into CRM updates, bookings, social posting, shopping flows, or app-native Chinese services, runtime venue becomes part of the product.
What changed in the agent surface
Manus makes the split explicit in product language. Its Browser Operator runs inside the user's own browser, using existing logins and active tabs. The documentation is unusually clear about why that matters: local browser access is the preferred lane for authenticated sessions and sensitive sites, and it helps avoid CAPTCHA and security checks that appear when an unfamiliar environment shows up.[1]
The same docs set that against Cloud Browser, an isolated browser environment in the cloud. Manus positions that lane for broad web tasks, multi-step research, and authenticated actions that can be performed after the user logs in within the cloud session. It also warns that data-center IPs can trigger more verification steps and says users should prefer "My Browser" for sensitive sites.[2]
That is a meaningful product signal. The company is not pretending one browser surface solves everything. It is admitting that the difference between trusted local state and disposable cloud state is now an architectural boundary.
Zhipu's AutoGLM-Phone pushes the same logic onto mobile rails. The release notes frame it as an AI phone assistant that can complete app-operation tasks in natural language across 50+ mainstream Chinese application scenarios, covering shopping, travel, delivery, media, and information flows.[3] The model page adds the implementation boundary: AutoGLM-Phone is a vision-language phone-agent framework that reads the screen and drives the device through ADB, with an Android-only hardware scope and a concrete action set that includes launch, tap, type, swipe, back, long press, and human take-over for login or verification steps.[4]
Put differently, Zhipu is not only shipping "agentic" text. It is shipping an execution venue where the unit of work is an app screen.
ByteDance's UI-TARS line rounds out the picture from the desktop side. UI-TARS-desktop documents both local and remote computer/browser operators, while the broader UI-TARS repo presents a benchmark story that spans browser, desktop, and phone-use environments rather than one generic "agent" score.[5][6] In the public table, UI-TARS-1.5 reports 84.8 on WebVoyager for browser use, 42.5 on OSWorld with a 100-step setup for desktop/OS tasks, and 64.2 on Android World for phone use.[6]
Those numbers should be treated carefully. They are benchmark-specific and vendor-reported, and each benchmark defines a different environment, action space, and failure pattern. The useful point is not that one score settles the market. The useful point is that Chinese agent builders are now publishing against separate runtime lanes because the lanes themselves are product categories.[6]
Why runtime topology now matters more than one leaderboard
Once the job involves authentication, the runtime controls four things that model ranking alone cannot settle.
1. Trust inheritance
A local browser inherits the cookies, sessions, and network reputation the user already has. That is why Manus explicitly recommends it for authenticated sessions and sensitive sites.[1] A cloud browser starts clean and needs fresh login state; it gains isolation, but it also attracts more anti-bot friction.[2]
2. Action medium
Browser operators are good at sites that already expose most value through the web. AutoGLM-Phone's design is aimed at Chinese app-native workflows where the critical path lives inside Android apps rather than a desktop browser.[3][4]
3. Verification burden
The difference between "works in a demo" and "works in production" is often a verification wall. Manus says cloud-browser users should expect more checks from data-center IPs and should switch to their own browser for sensitive sites.[2] AutoGLM-Phone bakes in a formal Take_over action for login and CAPTCHA-style intervention instead of pretending those steps disappear.[4]
4. Cost of repeatability
An isolated remote lane can be reset, replayed, and scaled more easily than a personal logged-in session. A local lane carries more trust and less repeatability. That trade-off will shape how teams separate consumer helpers, internal copilots, and heavier-duty automation services.
The practical read for builders
For builders evaluating China-agent stacks in 2026Q1, the better procurement question is no longer "which frontier model should we back?" It is "which runtime lane matches the task boundary?"
Three rules follow from the current public evidence.
First, write the task surface down before comparing models. If the task lives inside Taobao, Meituan, Xiaohongshu, or another app-native Chinese flow, a browser-only evaluation is already mis-scoped.[3][4]
Second, separate trusted-state tasks from disposable-state tasks. Trusted-state tasks want the user's own browser or device session. Disposable-state tasks, such as broad research, extraction, or repeatable back-office workflows, are better candidates for isolated cloud or remote operators.[1][2][5]
Third, keep benchmark reading bounded. A browser benchmark such as WebVoyager does not answer phone-use reliability, and an OSWorld score does not tell you how often a logged-in cloud browser will hit verification drag. Public tables are directional; runtime fit still decides operational quality.[6]
What this suggests about the China market
The market is moving toward a four-layer agent stack:
- Model layer for reasoning, perception, and planning.
- Runtime layer for local browser, cloud browser, remote desktop, or phone control.
- Verification layer for login, CAPTCHA, and human take-over.
- Distribution layer where the agent meets the user: browser extension, desktop app, phone workflow, or chat surface.
Most public discussion still overweights the first layer. The product documents increasingly point to the second and third.
That is why the current agent race in China is starting to look less like a pure model contest and more like a contest over where automation is legally, technically, and behaviorally allowed to happen.
Falsifier and watchlist
This thesis weakens if leading vendors converge on one universal execution surface that handles trusted logins, verification-heavy websites, browser work, and app-native phone tasks with similar reliability. The public documentation does not point there today.
The next quarter is worth watching for three things:
- Whether more vendors expose explicit local-vs-cloud runtime switching in product UX, not only in docs.[1][2][5]
- Whether phone-agent products widen beyond demoable consumer tasks into more durable service or enterprise workflows.[3][4]
- Whether benchmark tables begin reporting more deployment-relevant failure categories such as verification interrupts, take-over rate, and session persistence instead of aggregate pass scores alone.[4][6]
Sources
- Manus Documentation, "Browser Operator" (local browser lane, existing sessions, local-vs-cloud comparison).
- Manus Documentation, "Cloud browser" (isolated cloud browser, authenticated actions, data-center IP considerations).
- 智谱 AI 开放文档,"新品发布"(AutoGLM-Phone 发布时间线与 50+ 中文应用场景说明)。
- 智谱 AI 开放文档,"AutoGLM-Phone"(VLM + ADB 设备控制、Android 边界、动作集合与 Take_over 机制)。
- ByteDance, "UI-TARS-desktop Quick Start" (local and remote computer/browser operator packaging).
- ByteDance, "UI-TARS" repository README (WebVoyager, OSWorld 100-step, and Android World benchmark table).