Inference gets the headlines. Embeddings run the rails.

Every production RAG pipeline in China has the same upstream dependency: a model that converts text into a high-dimensional vector that a database can search. That step — embedding — determines retrieval precision before the LLM ever sees a retrieved document. And since mid-2023, the default answer to "what embedding model?" for Chinese-language retrieval has been BGE from BAAI, the Beijing Academy of Artificial Intelligence.

Understanding why requires two things: the C-MTEB benchmark that exposed the gap, and the C-Pack release package that filled it.

The benchmark that revealed the gap

Multilingual embedding models — text-embedding-ada-002, E5-large-v2, multilingual-e5-large — were built to generalize across languages. They work tolerably for Chinese retrieval. They don't work as well as a model that trained primarily on Chinese-language corpora and was evaluated on Chinese-specific retrieval, classification, and clustering tasks.

C-MTEB (Chinese Massive Text Embedding Benchmark) established that gap with numbers [1]. Published alongside the C-Pack paper in 2023, C-MTEB assembled six task categories covering retrieval, semantic textual similarity, bitext mining, and classification — all in Chinese. When multilingual models were benchmarked on those tasks, the performance delta was consistent and large enough to matter for production retrieval precision. A model like text-embedding-ada-002 could rank documents correctly at a macro level; it would fail at the disambiguation tasks that separate relevant from adjacent content in Chinese legal or financial text.

BGE-large-zh-v1.5 — the initial release — outperformed every available multilingual alternative on C-MTEB retrieval tasks [1][4]. The gap was not marginal.

C-Pack: supply chain as a package

The C-Pack release was structured as a deployment package, not just a model weight drop [1]. It shipped three things together:

That package structure mattered for adoption. An engineering team evaluating embedding options could run a full retrieval benchmark on their own data in under a day, compare against published results, and then fine-tune on domain text using the same toolchain. The supply chain — from open weight to deployed endpoint — lived inside one GitHub repository.

BGE-M3: closing the multilingual routing decision

The single most consequential BGE release for production deployments outside pure Chinese-language contexts was BGE-M3, published in early 2024 [3]. M3 stands for multi-lingual, multi-functionality (dense, sparse, and multi-vector retrieval in one model), multi-granularity (from sentence to 8,192-token documents).

The engineering impact: cross-language retrieval stacks that previously required a Chinese model for zh content and a separate model for en, ja, or ko content could collapse to a single BGE-M3 endpoint. For enterprises with mixed-language knowledge bases — common in financial services and manufacturing contexts where product documentation arrives in Japanese or English but queries arrive in Chinese — BGE-M3 resolved a two-model routing problem into one.

MTEB scores confirmed that BGE-M3 performed competitively on English and European-language benchmarks while maintaining C-MTEB parity [3][4].

The deployment split in production

By Q1 2026, Chinese enterprise RAG deployments typically use one of three embedding configurations:

Cloud API. DashScope text-embedding-v2 (Alibaba), Baidu Qianfan embedding endpoint, ByteDance embedding via Volcano Engine [5]. Cost-optimized for high-volume indexing jobs where latency is not the binding constraint and data-residency requirements permit external API calls. Token pricing has compressed through 2025; the economics favor cloud APIs for workloads above roughly 1 billion tokens per month where the GPU amortization curve inverts.

Self-hosted BGE-M3. Deployed via FastAPI or Ollama on a private GPU instance. Preferred when data-residency requirements prohibit sending document content to an external API, or when the query-volume economics favor in-house compute. The latency profile is better for sub-100ms retrieval SLAs because round-trip API overhead disappears.

Fine-tuned BGE adapter. A LoRA or full fine-tune of BGE-large or BGE-M3 on domain-specific corpora [2]. Legal, medical, and financial teams are the primary users. The adapter checkpoint stays on the team's inference cluster; query routing uses the same FlagEmbedding inference interface as the base model, so the production path does not change between base and fine-tuned weights. This is where the stickiness lives: once a team has run a fine-tuning loop inside FlagEmbedding and validated retrieval precision improvement on their domain data, switching to a different embedding provider requires rerunning that loop on a different toolchain — not just swapping an API key.

What the supply chain looks like end-to-end

From an engineer's perspective the stack resolves to: FlagEmbedding library → BGE-M3 base → optional domain fine-tune → vector DB (Milvus, Weaviate, or cloud-hosted equivalent). The embedding layer is invisible to the LLM that sits downstream; it is infrastructure in the same sense that a message queue is infrastructure — present in every production deployment, invisible in every product demo.

From a vendor perspective the dynamics are stickier than the open-weight framing suggests. Alibaba and Baidu both offer hosted embedding APIs with competitive per-token pricing [5]. But the enterprises most willing to pay for embedding-as-a-service are also the ones with the lowest tolerance for document content leaving their private network. That self-selection effect has kept BGE self-hosting at a higher share than the API pricing differential alone would predict.

The gap between the headline competition at the LLM layer and the supply-chain competition at the embedding layer is wide. BGE's open-weight advantage established in 2023 is not narrowing. FlagEmbedding has been updated through the M3 generation; C-MTEB evaluation infrastructure is actively maintained; and the fine-tuning pathway has become the primary enterprise retention mechanism. The weight that determines retrieval quality in Chinese-language production systems distributes from a GitHub repository maintained by a Beijing research institute — and that distribution pattern shows no sign of centralizing back toward managed APIs.

Sources

  1. Shitao Xiao et al., "C-Pack: Packaged Resources To Advance General Chinese Embedding." BAAI, 2023.
  2. FlagOpen, FlagEmbedding: Retrieval and Retrieval-Augmented LLMs. GitHub repository, 2023–2026.
  3. Chen et al., "BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation." BAAI, 2024.
  4. MTEB Leaderboard. Hugging Face Spaces — evaluated results across BGE model family.
  5. Alibaba Cloud Model Studio, Text Embedding API Documentation.