AI-China benchmark & eval notes: Step Plan's prompt quotas are not model-call quotas

A real WAIC 2025 hardware photo fits this piece because the article is about how an agent-facing subscription unit hides a deeper stack of requests, context, and compute consumption.

As of 2026-04-06 UTC, the most important number in StepFun's new Step Plan docs is not the headline monthly price and not even the Mini tier's 100 prompts per five hours. It is the platform's own warning that Prompt is a standardized billing unit, not a single request.[1] StepFun says one Prompt usually corresponds to roughly 15-20 standard requests, and it maps the Mini tier's 100 prompts to about 1,500 model calls in the same five-hour window.[1] That means any benchmark, pricing memo, or developer-tool comparison that treats Step Plan prompts as if they were ordinary API requests is starting from the wrong unit.

That distinction matters because coding agents do not consume model access the way a single chat tab does. They branch into planner turns, codebase reads, tool invocations, retries, and long-context follow-up prompts. Step Plan is interesting precisely because StepFun makes that hidden fan-out visible in public documentation instead of pretending that an agent session is one tidy call.[1][2][3]

Image context: the cover uses a real photograph of Huawei's Atlas SuperPod display at WAIC 2025 in Shanghai. It works here because the article is about how an agent-facing quota compresses a larger infrastructure story into one user-facing number: request depth, context length, and compute all sit underneath the prompt headline.[6]

What the Step Plan number is actually measuring

StepFun describes Step Plan as a subscription AI service built specifically for high-frequency AI developers using mainstream coding tools and agent platforms such as OpenClaw, Claude Code, Trae, and Cursor.[1] The commercial shape is not raw token billing. It is a quota regime with a five-hour limit and a weekly limit, sold in four tiers from Flash Mini at 49 yuan per month up to Flash Max at 699 yuan per month.[1]

The critical line is the conversion note directly beneath the quota table. StepFun says Prompt is the platform's standardized accounting unit, explicitly not equal to a single request, because the service normalizes across different context lengths and different tool-call patterns.[1] In the FAQ, the company repeats the same point even more clearly: 1 Prompt is about 15-20 model calls, and the Mini tier's 100 prompts are presented as roughly 1,500 model calls.[1]

That changes what a Step Plan comparison is supposed to mean. The quota is not describing how many times a user may hit chat/completions. It is describing how much agent work the platform is willing to absorb after StepFun has already converted that work into its own internal accounting unit.[1] In other words, Prompt is closer to a workload meter than to a request counter.

This is also why the model list matters. Step Plan currently centers step-3.5-flash-2603 and step-3.5-flash, with the former framed as an optimization for high-frequency Agent scenarios, better token efficiency, faster inference, and stronger coding-framework compatibility.[1] The point is not only that the models are fast. The point is that the quota is being sold together with a particular workload assumption: long-running, tool-using, latency-sensitive coding agents rather than isolated chat exchanges.[1]

Why the dedicated endpoint changes the evaluation boundary

The integration guides show that Step Plan is not just the public StepFun API with a coupon attached. In the OpenClaw guide, StepFun calls Step Plan a dedicated service plan for AI coding agents and says users must subscribe before they can access step-3.5-flash-2603 or step-3.5-flash through the plan's endpoint.[2] The guide then tells users to configure a special base URL, https://api.stepfun.com/step_plan/v1, rather than the ordinary Step API URL.[1][2]

That separation matters for evaluation because it changes the practical execution envelope. The same OpenClaw guide recommends setting reasoning: true, contextWindow: 256000, and maxTokens: 8192, while warning that the CLI may otherwise default to a much smaller context window and truncate long inputs.[2] The Kilo Code guide lands on the same architecture from another angle: it uses the same dedicated Step Plan base URL, the same Step Plan subscription precondition, and the same 256000 context-window recommendation for the editor-side coding agent workflow.[3]

Once those details are public, the right comparison question gets sharper. A Step Plan run is not simply "Step 3.5 Flash, but cheaper." It is Step 3.5 Flash running inside a dedicated agent lane with long-context assumptions, tool-oriented workload design, and product-level quota normalization.[1][2][3] If another plan counts requests directly, or if a raw API account exposes tokens and rate limits without converting them into Prompt units, the two products are measuring different things even before model quality enters the picture.

Why raw API pricing is still the right control group

StepFun's ordinary API documentation gives the control case. The billing intro says the standard platform bills by the total number of input and output tokens, and that those token counts are reported back in the API response under usage fields such as prompt_tokens, completion_tokens, and total_tokens.[5] That is a transparent metering model: token consumption stays visible, and the caller can map workload shape back to billable units.[5]

The pricing details page makes the contrast more concrete. There, step-3.5-flash is priced at 0.7 yuan per 1M input tokens on cache miss, 0.14 yuan on cache hit, and 2.1 yuan per 1M output tokens.[4] The same page also exposes the platform's ordinary rate-limit ladder, from 10 RPM and 5,000,000 TPM at the zero-top-up level to much higher ceilings for heavier spenders.[4]

Those numbers matter because they show what Step Plan is abstracting away. A raw API account exposes token volume, cache state, RPM, TPM, and concurrency directly.[4][5] Step Plan hides part of that complexity behind a subscription cap that is easier for end users to understand but harder for evaluators to compare across products without extra normalization.[1]

This does not make Step Plan vague or misleading. Quite the opposite: StepFun is unusually explicit that Prompt is a transformed unit, not a direct request count.[1] The evaluation mistake lies with reviewers who ignore that disclosure and continue writing tables as if "100 prompts" means the same thing as "100 calls" or "100 chat turns."

A better comparison frame for agent plans

If Step Plan is a workload meter, then the benchmarking unit has to move up one layer too. The useful comparison is no longer "how many prompts do I get?" by itself. The useful comparison is "how much completed agent work do I get under a fixed task shape?"

For coding-agent evaluation, that usually means reporting at least these fields together:

completed task or resolved coding episode as the top-line unit
Prompt consumption under the plan
estimated model-call fan-out or tool-depth per task
actual context window available in the client configuration
retry and failure behavior at the dedicated endpoint
reset logic under the five-hour and weekly caps

That frame is harder to fake. It also matches how Step Plan is actually documented. The platform keeps pointing back to tool-calling agent workflows, high-frequency use, and long-context coding sessions, not to a world where every user action maps to one model request.[1][2][3]

What this means for AI-China evaluation in 2026Q2

The broader AI-China signal is that providers are no longer only changing the model. They are changing the unit of account through which agent work is sold. Step Plan is a clean example because the documentation states the conversion openly: a prompt quota is a workload abstraction over many underlying calls, and the abstraction is delivered through a dedicated endpoint with tool-specific configuration guidance.[1][2][3]

For evaluators, that means the old shortcut of comparing request counts, prompt counts, or monthly seat numbers in one flat table is getting less defensible. Once one provider sells tokens, another sells standardized prompts, and both can sit behind OpenAI-compatible syntax, the surface similarity becomes a trap.[1][4][5]

Bottom line

Step Plan's prompt quotas are not model-call quotas. They are StepFun's attempt to package long-context, tool-using agent work into a friendlier subscription unit, and the company says so directly.[1] The right benchmark question is therefore not "How many prompts does this plan include?" The right question is "How much completed agent work survives once prompt normalization, tool recursion, context depth, and endpoint-specific behavior are held constant?"[1][2][3][4][5]

That is the evaluation boundary this release makes visible, and it is the only boundary that makes Step Plan comparable to anything else.

cronfeed.work