China's embedding layer: how BGE and C-Pack turned open-weight retrieval into the invisible rail under domestic RAG deployments

Inference gets the headlines. Embeddings run the rails.

Every production RAG pipeline in China has the same upstream dependency: a model that converts text into a high-dimensional vector that a database can search. That step — embedding — determines retrieval precision before the LLM ever sees a retrieved document. And since mid-2023, the default answer to "what embedding model?" for Chinese-language retrieval has been BGE from BAAI, the Beijing Academy of Artificial Intelligence.

Understanding why requires two things: the C-MTEB benchmark that exposed the gap, and the C-Pack release package that filled it.

The benchmark that revealed the gap

Multilingual embedding models — text-embedding-ada-002, E5-large-v2, multilingual-e5-large — were built to generalize across languages. They work tolerably for Chinese retrieval. They don't work as well as a model that trained primarily on Chinese-language corpora and was evaluated on Chinese-specific retrieval, classification, and clustering tasks.

C-MTEB (Chinese Massive Text Embedding Benchmark) established that gap with numbers [1]. Published alongside the C-Pack paper in 2023, C-MTEB assembled six task categories covering retrieval, semantic textual similarity, bitext mining, and classification — all in Chinese. When multilingual models were benchmarked on those tasks, the performance delta was consistent and large enough to matter for production retrieval precision. A model like text-embedding-ada-002 could rank documents correctly at a macro level; it would fail at the disambiguation tasks that separate relevant from adjacent content in Chinese legal or financial text.

BGE-large-zh-v1.5 — the initial release — outperformed every available multilingual alternative on C-MTEB retrieval tasks [1][4]. The gap was not marginal.

C-Pack: supply chain as a package

The C-Pack release was structured as a deployment package, not just a model weight drop [1]. It shipped three things together:

BGE model weights at multiple sizes (small, base, large) optimized for Chinese-language retrieval
FlagEmbedding, a Python library wrapping inference, fine-tuning workflows, and adapter management in a single interface [2]
C-MTEB benchmark definitions and evaluation code so teams could replicate published results or run private domain evaluations against their own corpora

That package structure mattered for adoption. An engineering team evaluating embedding options could run a full retrieval benchmark on their own data in under a day, compare against published results, and then fine-tune on domain text using the same toolchain. The supply chain — from open weight to deployed endpoint — lived inside one GitHub repository.

BGE-M3: closing the multilingual routing decision

The single most consequential BGE release for production deployments outside pure Chinese-language contexts was BGE-M3, published in early 2024 [3]. M3 stands for multi-lingual, multi-functionality (dense, sparse, and multi-vector retrieval in one model), multi-granularity (from sentence to 8,192-token documents).

The engineering impact: cross-language retrieval stacks that previously required a Chinese model for zh content and a separate model for en, ja, or ko content could collapse to a single BGE-M3 endpoint. For enterprises with mixed-language knowledge bases — common in financial services and manufacturing contexts where product documentation arrives in Japanese or English but queries arrive in Chinese — BGE-M3 resolved a two-model routing problem into one.

MTEB scores confirmed that BGE-M3 performed competitively on English and European-language benchmarks while maintaining C-MTEB parity [3][4].

The deployment split in production

By Q1 2026, Chinese enterprise RAG deployments typically use one of three embedding configurations:

Cloud API. DashScope text-embedding-v2 (Alibaba), Baidu Qianfan embedding endpoint, ByteDance embedding via Volcano Engine [5]. Cost-optimized for high-volume indexing jobs where latency is not the binding constraint and data-residency requirements permit external API calls. Token pricing has compressed through 2025; the economics favor cloud APIs for workloads above roughly 1 billion tokens per month where the GPU amortization curve inverts.

Self-hosted BGE-M3. Deployed via FastAPI or Ollama on a private GPU instance. Preferred when data-residency requirements prohibit sending document content to an external API, or when the query-volume economics favor in-house compute. The latency profile is better for sub-100ms retrieval SLAs because round-trip API overhead disappears.

Fine-tuned BGE adapter. A LoRA or full fine-tune of BGE-large or BGE-M3 on domain-specific corpora [2]. Legal, medical, and financial teams are the primary users. The adapter checkpoint stays on the team's inference cluster; query routing uses the same FlagEmbedding inference interface as the base model, so the production path does not change between base and fine-tuned weights. This is where the stickiness lives: once a team has run a fine-tuning loop inside FlagEmbedding and validated retrieval precision improvement on their domain data, switching to a different embedding provider requires rerunning that loop on a different toolchain — not just swapping an API key.

What the supply chain looks like end-to-end

From an engineer's perspective the stack resolves to: FlagEmbedding library → BGE-M3 base → optional domain fine-tune → vector DB (Milvus, Weaviate, or cloud-hosted equivalent). The embedding layer is invisible to the LLM that sits downstream; it is infrastructure in the same sense that a message queue is infrastructure — present in every production deployment, invisible in every product demo.

From a vendor perspective the dynamics are stickier than the open-weight framing suggests. Alibaba and Baidu both offer hosted embedding APIs with competitive per-token pricing [5]. But the enterprises most willing to pay for embedding-as-a-service are also the ones with the lowest tolerance for document content leaving their private network. That self-selection effect has kept BGE self-hosting at a higher share than the API pricing differential alone would predict.

The gap between the headline competition at the LLM layer and the supply-chain competition at the embedding layer is wide. BGE's open-weight advantage established in 2023 is not narrowing. FlagEmbedding has been updated through the M3 generation; C-MTEB evaluation infrastructure is actively maintained; and the fine-tuning pathway has become the primary enterprise retention mechanism. The weight that determines retrieval quality in Chinese-language production systems distributes from a GitHub repository maintained by a Beijing research institute — and that distribution pattern shows no sign of centralizing back toward managed APIs.

cronfeed.work