Ant 的 Ling 2.5 说明中国开源模型竞争正在转向 token 效率

杭州蚂蚁 A 空间给 inclusionAI 的模型卡叙事提供了现实锚点：公开主张已经超出实验室基准，指向一家杭州平台公司试图把开源模型转成高效率基础设施。[5]

截至 2026-04-22T09:03:44Z UTC，蚂蚁集团 inclusionAI 的模型卡给 AI-China 提供了一个有用信号：下一轮开源模型竞争，焦点已经超出谁能发布更大的 checkpoint，或者谁能在单个 benchmark 上写下更高分数。Ling-2.5-1T 与 Ring-2.5-1T 把一个更窄的问题推到前台：一个万亿参数模型，在每个生成 token、每一份 serving memory 上，究竟能交付多少有用的推理、工具使用与长上下文工作。[1][2][3]

这个框架成立，是因为两款发布本身按 workload 分开。Ling-2.5-1T 被写成一条 "instant" 模型线：1T 总参数、63B 激活参数、预训练语料扩展到 29T tokens、hybrid linear attention，并通过 YaRN extrapolation 支持最高 1M token 上下文。[1] Ring-2.5-1T 则是 thinking sibling：面向 deep thinking 与 long-horizon task execution 的 hybrid-linear-attention reasoning model，在架构主张旁边强调自测数学结果，以及 agent-search / tool benchmarks。[2]

这里最重要的 benchmark note，是需要把两类主张分开。Ling 的评估对象是高效率通用通道：instruction following、long context、agent compatibility 与较低 token 消耗。Ring 的评估对象是更深的 reasoning 通道：数学证明、coding、tool collaboration 与 extended execution。[1][2] 若路由团队把其中任一模型塞进通用 leaderboard 行里，同时忽略这层区分，评估结果就会失真。

图片说明：题图显示的是 2021 年杭州蚂蚁 A 空间。它以真实园区照片进入这篇文章，区别于模型卡截图与发布会图形。这个选择是有意的。本文讨论的是蚂蚁开源模型家族周围的基础设施经济学，因此一张真实的杭州蚂蚁园区照片，比又一张 benchmark chart 更诚实。[5]

标题超过 1T

1T 标签有意义，但这次发布最值得看的部分在 token 效率。Ling-2.5-1T 的模型卡写到，trillion-scale 版本激活 63B 参数，高于此前 Ling 2.0 trillion-scale 架构中的 51B，同时在 incremental training 之后，把 attention 组合调整为 1:7 的 MLA 加 Lightning Linear。[1] 这让模型卡形成了一条具体技术主张：如果 attention 路径发生改变，更大的激活容量可以避免长上下文 serving 行为按比例变差。

模型卡也用 benchmark 语言表达同一件事。它写到，Ling-2.5-1T 在 knowledge、reasoning、agentic performance、instruction following 与 long-context processing 等多个维度接受评估，并在所选案例中，比 frontier "thinking" models 使用更少 tokens，取得可比的 reasoning performance。[1] 具体对比仍属于厂商自述，因此需要按第一方 claim 理解。即便如此，指标选择本身重要。蚂蚁希望读者在 answer accuracy 之外评估 token efficiency。

这是一条很合适的评估转向。放到生产环境里，一个模型若能用更少 output tokens 达到相近答案，就会改变 latency、queueing、cost 与 context-management behavior。它也会改变哪些任务可以先进入 "instant" lane，再在必要时升级到较慢的 thinking lane。发布后的建设者问题因此变成：“哪些 workload 可以由 Ling 用足够低的 token budget 完成，再把更难任务交给 thinking model？”

Ling 自己的 limitations section 也支持这种读法。模型卡写到，Ling-2.5-1T 为 general-purpose agents 奠定基础，但在 complex agent interactions 与 long-horizon tasks 上仍落后于 frontier models。[1] 这个 caveat 正好标出路由边界。Ling 更像快速、长上下文、开源权重通道；Ring 才是蚂蚁尝试推进更深推理与长程执行的位置。

Ring 把评估推向 decoding 经济学

Ring-2.5-1T 从 reasoning 侧把同一条线索拉得更清楚。模型卡称其为第一款基于 hybrid linear attention architecture 的开源万亿参数 thinking model，并给出操作层主张：对于超过 32K tokens 的序列，它报告 memory access overhead 降低 10x 以上，generation throughput 提升 3x 以上。[2] 这些是第一方架构 claim，但当 thinking models 拉长输出长度时，这类 claim 恰好进入生产评估核心。

evaluation section 随后把数学与 agent work 放在一起。模型卡报告了自测 IMO 2025 与 CMO 2025 结果，称 Ring-2.5-1T 在两项上都达到 gold-medal level，并列出更难的 reasoning 与 execution benchmarks，包括 IMOAnswerBench、AIME 26、HMMT 25、LiveCodeBench、ARC-AGI-V2、Gaia2-search、Tau2-bench 与 SWE-Bench Verified。[2] 模型卡链接的仓库还公开了 IMO25 与 CMO25 的 example solution folders，这是一条有用的透明度信号，同时它仍需独立 benchmark audit 补强。[4]

生产层面的意义主要落在蚂蚁怎样定位 "thinking"。如果一个 reasoning model 会消耗大量 internal text、scratch work、tool calls 与 summarization，那么 decoding throughput 就成为 capability claim 的一部分。一个模型可以在静态 benchmark 上赢，同时因为时间或 memory 消耗过高，在真实 workload 里输掉。Ring 的模型卡试图说明，architecture、RL 与 benchmark depth 必须放在一起评估。[2]

这让蚂蚁在 AI-China 中进入一条有辨识度的 lane。DeepSeek 把 sparse-attention 与 reasoning economics 推成主流话题。Kimi 持续强调 long context 与 agent swarm 语言。Qwen 则把 open weights 与 model-platform distribution 做成很宽的开发者表面。Ant 的 Ling/Ring 分叉进入同一场讨论时，带着明显的 fintech-platform 偏向：减少浪费的 tokens，能承载文档的长上下文，以及可以进入专业 workflow 的 agent compatibility。[1][2][3]

接下来该测什么

对评估 Ling-2.5-1T 或 Ring-2.5-1T 的团队而言，下一组 benchmark 应当按 workload 组织，超出 leaderboard 排名本身。

第一，测试 每个生成 token 的 answer quality。相关比较需要覆盖最终分数与 token 消耗，尤其是 Ling 能否用少于 thinking model 的 output tokens 关闭常见任务，同时保留 instruction following 与 source-grounded behavior。document review、customer-service drafting、compliance memo extraction 与 internal knowledge-base Q&A，在这里比抽象 chat prompts 更适合做测试。[1]

第二，测试 context position 与 retrieval stress。Ling 的 1M-token context claim 只有在模型跨 position、distraction density 与 mixed-document formats 仍保持有效 accuracy 时才有价值。模型卡引用了 NIAH、RULER 与 MRCR 这类 long-context evaluations，同时也承认与领先 closed API models 之间仍有差距。[1] 团队需要用自己的 documents 复现这条边界：contracts、financial disclosures、support histories、medical policies 或 codebases。

第三，测试 agent handoff，超出一步 tool use。Ling 写到它在 high-fidelity interactive environments 中接受 agentic RL 训练，并兼容 Claude Code、OpenCode 与 OpenClaw；Ring 则写到它可以适配 agentic programming frameworks 与 personal AI assistants。[1][2] 这些 claim 值得用 failure recovery 来测试，覆盖 happy-path function calls 之外的恢复能力。有效 eval 应当包含 tool errors、stale files、ambiguous instructions，以及一个强制 checkpoint，让模型必须修订自己的 plan。

第四，诚实测试 serving footprint。即便有 63B active parameters 与 hybrid linear attention，1T model 也属于需要认真规划的部署对象。Ling 模型卡里的 SGLang 示例是 multi-node 的，并且明确说命令需要按实际环境调整。[1] 这意味着 provider availability、quantization、batch size 与 hardware lane 都属于 benchmark 的一部分。开放许可重要，但 openness 仍然需要面对 deployment physics。

AI-China 信号

更大的信号在于，中国开源模型竞争正在走向更细分的形态。真正值得看的动作，是把快速、长上下文的 instant lane，与更深的 thinking lane 分开，同时把两者都放在 hybrid linear attention 与 agent workload economics 的语境里。[1][2]

对蚂蚁而言，这种分离贴合公司或许面对的应用表面。支付、金融服务、医疗健康接口、商家运营、风控审核与文档密集型 workflow，都奖励能读取长上下文、并让高频请求避开昂贵 reasoning marathon 的模型。一个快速模型承接高比例日常工作，再由 thinking model 升级困难案例，这种平台模式比一款 universal model 同时承担所有 latency 与 cost tier 更可信。

需要保留的 caveat 是证据成熟度。Ling 与 Ring 目前最强的数字仍来自模型发布方的模型卡。这在发布周期里很正常，但下一步信心必须来自可复现的第三方 eval、provider benchmark 与用户 workload traces。在此之前，较稳妥的结论应当收窄：蚂蚁已经把 token efficiency、hybrid linear attention 与 long-context serving economics 放到自己的开源模型叙事中心，而这对 2026Q2 的 AI-China 来说，是一个真实方向信号。[1][2][3]

cronfeed.work

Ant 的 Ling 2.5 说明中国开源模型竞争正在转向 token 效率

标题超过 1T

Ring 把评估推向 decoding 经济学

接下来该测什么

AI-China 信号

来源

Recommended In ai china