As of 2026-03-27 UTC, one of the clearest AI-China production surfaces is no longer the generic assistant app. It is the smart cockpit.

The reason is operational. A car creates a repeated, hands-busy, latency-sensitive voice workload where the model has to do more than answer a prompt. It has to manage interruption, understand cabin context, call tools, hand work across devices, and keep the interaction calm enough that a driver will actually keep using it.[1][2][3]

That is why the Geely-StepFun pairing matters. Geely's public vehicle strategy is now explicitly AI-native, with a system-level OS meant to coordinate vehicles, phones, tablets, wearables, and smart-home endpoints.[1] StepFun's public audio stack is explicitly built for realtime speech interaction rather than text-only chat, including WebSocket-based realtime access, end-to-end voice interaction, native tool calling, web search, multilingual handling, and emotional expression.[3][4] Put those together and the cockpit starts to look less like a flashy demo scene and more like a paid distribution lane.

The use case: turn voice from feature into operating surface

Geely's CES 2025 announcement is useful because it describes the target architecture in plain terms. The company says its "Full-Domain AI for Smart Vehicles" system rests on an AI-native OS that can coordinate perceived data and service delivery across the vehicle and adjacent personal devices.[1] That is materially different from the older infotainment model where the car mostly waited for one command at a time.

The later Geely Yinhe M9 launch page makes the same shift more concrete. The M9 is positioned not just as an SUV with a screen, but as a vehicle that debuts Geely's next-generation AI Smart Cockpit alongside AI digital chassis and assisted-driving systems.[2] In other words, the smart cockpit is being treated as part of the vehicle's core product architecture, not as a decorative software layer that can lag behind the hardware.

For AI builders, that distinction matters because the workload inside the cabin is unusually sticky. Navigation, climate control, music, calls, child-seat questions, charging stops, fatigue prompts, and trip planning all create short, repeated turns. A model that survives those turns reliably wins something better than a benchmark headline: daily habit.

Why the StepFun stack fits the cockpit better than a chat-only lane

StepFun's public documentation now reads like a speech stack designed for live interaction, not a thin voice wrapper over text completions. Its realtime interface is organized around persistent WebSocket connections rather than isolated HTTP calls, which is exactly what low-friction turn-taking needs in a moving car.[3]

The model layer matters too. StepFun describes step-audio-2 as an end-to-end audio model for natural interaction, with support for Mandarin, English, Japanese, emotional expression, voice cloning, native tool calls, and web search.[3] The open-source Step-Audio repository pushes the same picture from the engineering side: one framework spanning understanding and generation, multilingual and dialect handling, controllable speech style, and a realtime inference pipeline for continuous interaction.[4]

That combination changes the practical bar for an in-car assistant. The assistant no longer needs to be judged only on answer quality after a clean prompt. It also needs to stay usable when a user changes topic halfway through a sentence, mixes command with conversation, asks for a search result, or expects the system to respond in a tone that feels appropriate inside a confined shared space.

What the Geely-StepFun rollout signals

The most important public commercialization signal is the 2025 WAIC showcase announcement. BusinessWire's release says Geely Auto Group and StepFun jointly presented what they described as the first human-like in-vehicle AI agent for the Geely Galaxy M9, along with dual AI agents and an Agent OS frame for future interaction.[5]

That does not prove mass-market retention yet. But it does prove that the companies want the cockpit to be the place where realtime multimodal models graduate from lab narrative into shipped product logic.

An inference from these sources is that the smart-cockpit stack is settling into three layers:

  1. Realtime turn-taking for wake, interrupt, clarify, and continue.[3]
  2. Tool and service orchestration for navigation, search, media, device handoff, and vehicle functions.[1][3]
  3. Tone and trust management so the assistant feels usable in a family/shared-cabin setting rather than merely accurate in a benchmark sense.[3][4]

That three-layer view explains why the cockpit is strategically attractive. It demands model capability, orchestration discipline, and hardware distribution at the same time. Few AI surfaces bundle all three.

Why this counts as distribution, not just interface polish

Consumer AI apps can buy traffic. A cockpit has to be embedded into the hardware sale, the operating system, and the driver routine.

That makes the revenue logic slower to start, but harder to dislodge. Once the assistant is tied into the vehicle lifecycle, updates, account identity, navigation memory, and cross-device continuity, the model provider is no longer competing only for one session. It is competing for the default interaction layer of the cabin.

This is the real significance of the Geely materials' cross-endpoint framing and the StepFun materials' realtime-tooling emphasis.[1][3] The value is not "voice AI in a car" as a novelty. The value is that the car is one of the few environments where multimodal AI can become an always-nearby control surface with repeat exposure and direct product attachment.

Boundary, falsifier, and what to watch

The public evidence is still mostly vendor-side: official strategy pages, product docs, and launch communication. That is enough to identify the direction of travel, but not enough to claim a settled market winner.

The thesis in this article weakens if three things happen together over the next product cycle:

What to watch next:

  1. Whether Geely extends the AI-cockpit behavior beyond one flagship lane and into broader vehicle lines.[2][5]
  2. Whether StepFun's public audio stack keeps improving tool reliability and realtime behavior, not just voice style range.[3][4]
  3. Whether more Chinese OEMs treat the cockpit as a model-routing surface rather than a branded voice skin.[1][5]

Sources

  1. Geely Auto, "Geely Unveiled Auto Industry's First-Ever 'Full-Domain AI for Smart Vehicles' Technology System" (January 2025).
  2. Geely Auto, "Geely Auto initiated the 'Five by Five' Globalization Strategy and unveiled its new AI Powered, Six-Seater Flagship SUV" (May 2025).
  3. StepFun, "Realtime" and audio model documentation (WebSocket realtime access, step-audio-2, tool calling, and web search).
  4. stepfun-ai, "Step-Audio" GitHub repository (open-source speech-interaction framework, multilingual handling, controllable emotion/dialect, realtime inference pipeline).
  5. BusinessWire, "Geely Auto Group Teams Up with StepFun for a Joint Showcase at the 2025 World Artificial Intelligence Conference" (July 31, 2025).