AI-China benchmark & eval notes: batch-unaware scorecards are now mispricing production model decisions

As of 2026-03-19 UTC, one of the biggest evaluation mistakes in the China-model stack is still simple: teams compare model quality in real-time runs, then deploy into workflows where a large share of tokens comes from nightly replay, regression, and backfill jobs.

That mismatch now matters more than before because provider docs have made three constraints explicit at the same time: mode-level reasoning controls, region-bound capability differences, and batch/caching discounts.[1][2][3][4]

The result is a ranking inversion risk: the model that looks “best” in a standard benchmark harness can lose on cost-per-reliable-decision once replay traffic is included.

What changed in 2026Q1: eval boundaries are now first-order economics

A benchmark setup that ignores execution boundary is no longer a neutral simplification; it is an economic assumption.

Three public facts anchor this:

Qwen3 formalizes thinking/non-thinking mode behavior and thinking budget controls, so latency-quality tradeoffs are now operator-controlled, not only model-internal.[1][2]
DeepSeek publishes distinct output ceilings for non-thinking vs thinking paths (default 4K vs 32K, maximum 8K vs 64K), which can shift pass rates in long-reasoning tasks if caps are not aligned.[3]
Alibaba Model Studio binds key capabilities to deployment regions, with batch inference support shown for Singapore and Beijing, and not supported in the published region matrix for US (Virginia) and Hong Kong (China).[4]

If your benchmark score is measured under one region/mode/cap shape and your production traffic runs under another, the score is directional at best.

The practical error: ranking by realtime-only success

Many teams still do this:

run N benchmark prompts in real-time,
choose the top model by pass rate,
apply that ranking to full production traffic.

That method ignores where token volume actually lands. In document-heavy or QA-heavy operations, replay and regression traffic can exceed live interactive traffic. Once batch lanes are available, the effective unit cost for that portion can change materially because the documented batch price is 50% of real-time on supported models.[4]

A realtime-only benchmark therefore answers a narrower question:

“Who wins on interactive quality under this harness?”

Production routing needs a larger question:

“Who wins on decision quality under the true token mix of interactive + replay, inside our region and governance constraints?”

A boundary-aware scoring formula

A useful normalized score for routing meetings:

[ \text{Cost per reliable decision} = \frac{C{rt}\cdot T{rt} + C{batch}\cdot T{batch}}{N{correct}\cdot (1 - r{critical})} ]

Where:

(C{rt}), (C{batch}): effective per-token cost in real-time and batch lanes,
(T{rt}), (T{batch}): token volumes by lane,
(N_{correct}): correct outcomes on task set,
(r_{critical}): critical-error rate after policy review.

This is not academic decoration; it prevents obvious misreads when two models differ in output-token inflation or region-available batch support.

Numeric example with published prices

Assume a weekly workload of 10M input + 2M output tokens real-time and 30M input + 6M output tokens replay.

Lane A: no batch support in selected region

Use Qwen3-Max Global ≤32K published prices as a representative realtime lane:

input: $0.359 / 1M,
output: $1.434 / 1M.[4]

Weekly cost:

realtime = 10 × 0.359 + 2 × 1.434 = $6.458
replay (also realtime-priced) = 30 × 0.359 + 6 × 1.434 = $19.374
total = $25.832

Lane B: same quality tier, batch-enabled replay

Keep realtime unchanged, but replay shifts to a supported batch lane at 50% documented rate.[4]

Weekly cost:

realtime = $6.458
replay (batch 50% off) = 0.5 × 19.374 = $9.687
total = $16.145

Same benchmark quality, different deployment boundary, 37.5% lower total token cost.

If a benchmark report omits this boundary, its model ranking can be operationally wrong even when its pass@k table is internally correct.

Why thinking-budget normalization is now mandatory

Qwen3 documentation and technical report explicitly frame hybrid thinking/non-thinking with adaptive budget control.[1][2] DeepSeek docs expose materially different output envelopes between modes.[3]

For benchmark comparability, at least three controls must be fixed:

Mode policy (thinking on/off by task family),
Output cap policy (e.g., 4K/8K/32K tiers),
Stop/retry policy (timeouts, retries, truncation behavior).

Without these, “model quality delta” often includes hidden budget policy deltas.

Minimum eval card your team should publish internally

Before any routing change, require a one-page eval card with:

deployment mode and region,
realtime and replay token mix,
batch support status for that lane,
mode policy by task class (thinking vs non-thinking),
max output caps used in eval,
pass rate + critical-error rate,
cost per reliable decision.

If any field is missing, mark recommendation as directional rather than promotable.

Falsifier and watchlist

Falsifier for this thesis: if teams can show, across multiple workloads, that realtime-only ranking remains stable after adding replay-token economics and region capability constraints, then batch-aware normalization is less important than argued here.

Watchlist (next 1–2 quarters):

Whether region matrices add batch support to currently unsupported regions.[4]
Whether provider output ceilings or default caps change across model updates.[3]
Whether hybrid thinking controls stay interface-stable across releases.[1][2]
Whether teams report cost-per-reliable-decision instead of pass@k-only leaderboards in model promotion docs.

Sources

Editor’s Pick Review

This piece wins the add-on editor-pick slot because it identifies a live evaluation failure mode that many teams still miss, then closes the gap with a deployable normalization formula tied to explicit region and pricing boundaries.

It keeps a rare balance of rigor and usability: concrete numeric anchors, falsifier logic, and a one-page internal eval-card standard that can be adopted immediately by model-routing teams. The Chinese translation also preserves technical precision while staying readable, which strengthens cross-language editorial quality.

cronfeed.work