Jaeger v2 is a tracing backend, not just a place traces go

A real server-rack photograph fits this Jaeger architecture note because trace reliability is ultimately decided by collector capacity, storage pressure, buffering, and query load inside production infrastructure, not by the trace timeline view alone.[7]

Jaeger is easy to misread because the most visible part of the product is the trace UI. A developer searches for a slow request, opens a waterfall, sees which service waited, and leaves with a useful answer. That workflow is real, but it hides the architecture that decides whether the answer is there in the first place.

As of 2026-06-02T05:01:57Z UTC, Jaeger's latest GitHub release is v2.18.0, published on 2026-05-13, and the project repository shows 22,852 stars, 2,920 forks, 402 open issues, and a push timestamp of 2026-06-02T01:46:55Z.[3][4] Those numbers are not the architecture. They are a freshness check. The important change for operators is that Jaeger v2 is designed around a flexible binary built on OpenTelemetry Collector components, with roles that can be composed as collector, query, ingester, all-in-one, or agent-like placement.[1][3]

That means the adoption question should not be "Do we want Jaeger for tracing?" A better question is: where do we want write traffic, read traffic, buffering, sampling control, and storage responsibility to sit?

The first boundary is collector versus query

Jaeger's architecture documentation names the main roles plainly. A collector receives trace data and writes it into storage. A query service serves APIs and the UI for retrieving and visualizing traces. An ingester reads spans from Kafka and writes them to storage. An all-in-one mode combines collector and query in a single process.[1] That list looks simple, but it is the first architecture decision.

All-in-one mode is useful for local development, demos, and small proof-of-concept environments. Jaeger's docs are explicit that all-in-one with in-memory storage is not production-safe because data disappears on restart; Badger-backed all-in-one can work for modest volumes but stays single-instance and cannot scale horizontally.[1] Treating that mode as a production shortcut is the fastest way to confuse a successful demo with a reliable tracing platform.

The split collector/query shape is more important because trace systems have two very different load profiles. Write traffic arrives from applications and collectors, often in bursts. Read traffic arrives from humans, dashboards, incident response, API consumers, and sometimes automated analysis. If those paths share too much fate, the system can fail in frustrating ways: a storage spike makes ingestion drop data, or a heavy query window makes the UI slow exactly when an incident is unfolding.

Jaeger's docs state the advantage of collector/query separation directly: when using an external storage backend, both the all-in-one and separated configurations can scale horizontally, but collector/query separation lets teams scale read and write traffic independently and apply different access or security policies.[1] That is the architectural center of the piece. The trace UI is not the product boundary. The boundary is whether trace writes and trace reads can be operated as separate surfaces.

Direct-to-storage works until storage becomes the throttle

The second decision is whether collectors write directly to storage or pass through Kafka. In direct-to-storage mode, collectors receive spans and write them straight into the backend. Jaeger notes that storage must handle average and peak traffic, and while collectors may use an in-memory queue to smooth short-term peaks, sustained storage lag can still lead to dropped data.[1]

That makes direct-to-storage attractive when the workload is predictable, retention is bounded, and the storage backend has enough headroom. It is also the cleaner first production shape for many teams: fewer moving parts, fewer queues to operate, and less ambiguity about where data is delayed.

The failure mode is also clean. If storage cannot keep up, ingestion suffers. This is why tracing cannot be sized only by service count. You need to know span volume, peak request shape, cardinality in service and operation names, retention target, indexing expectations, and incident query behavior. A service estate that emits a modest number of well-shaped spans can be easier to operate than a smaller estate that traces too much internal detail and turns every request into a large graph.

Jaeger's storage page sharpens the point. The project requires persistent storage for durable production use, names Cassandra, Elasticsearch, and OpenSearch as primary supported distributed backends, and says that for large-scale production the Jaeger team recommends OpenSearch over Cassandra.[2] It also supports a gRPC Remote Storage API v2 for custom storage backends, plus memory and Badger for narrower cases.[2] Storage is therefore not a hidden implementation detail. It is the main operating contract.

Kafka is a buffer, not a magic reliability switch

Kafka enters the architecture when teams need to decouple collection from storage. In Jaeger's via-Kafka deployment, collectors publish spans to Kafka, and ingesters consume from Kafka and write to storage. The docs describe this as a way to prevent data loss between collectors and storage, with multiple ingesters able to partition ingestion load across them.[1]

That shape is powerful because it changes the failure window. A storage slowdown does not immediately force collectors to drop the same amount of data, provided Kafka capacity, retention, and ingester recovery are sized honestly. It also gives platform teams a clearer way to absorb bursty trace production from large deployments.

But Kafka also changes who owns the tracing platform. Once Kafka sits in the middle, the tracing stack now has queue retention, consumer lag, partitioning, disk pressure, and replay semantics. If the team already operates Kafka well, that may be the right trade. If the team barely has storage ownership for Jaeger itself, Kafka can turn a trace backend into another distributed system whose incidents are harder to explain.

The practical rule is simple: use Kafka when trace data is important enough, bursty enough, and high-volume enough to justify queue ownership. Do not use it merely because "production architecture" sounds more serious with a buffer in the diagram.

OpenTelemetry changes placement, not the need for boundaries

Jaeger v2's OpenTelemetry alignment is the other major architecture signal. The Jaeger binary is built on the OpenTelemetry Collector framework and includes upstream components such as OTLP receiver, batch and attribute processors, contrib components such as Kafka receiver/exporter and tail sampling processor, and Jaeger-specific components such as the storage exporter and query extension.[1]

This matters because many teams already run OpenTelemetry Collectors for metrics, logs, enrichment, routing, or vendor fan-out. Jaeger's docs say you do not need a separate OpenTelemetry Collector to operate Jaeger, because Jaeger itself is a customized distribution of the collector with different roles. But if OpenTelemetry Collectors are already part of the telemetry estate, they can sit in front of Jaeger as sidecars, host agents, daemonsets, or remote service clusters.[1]

That gives teams flexibility, but it also creates a design trap. OpenTelemetry Collector placement should not become a vague "more collectors equals more observability" pattern. Sidecar or host-agent placement can simplify SDK configuration and distribute enrichment work near applications. Remote collector clusters can help with sharding and tail-based sampling. Each placement also adds an extra marshaling and unmarshaling layer.[1]

The point is not to avoid OpenTelemetry Collectors. The point is to make each collector layer earn its keep. A local collector should own local enrichment, local buffering, or endpoint simplification. A remote collector should own routing, sampling, fan-out, or policy. If neither is true, the extra layer may be operational ceremony.

Sampling is an architecture decision

The Dapper paper is still useful here because it explains why distributed tracing had to be low overhead, transparent enough for broad deployment, and useful to developers and operators across large systems.[6] Dapper's authors emphasized sampling and instrumentation in common libraries as key design choices, not minor cost controls.[6] Jaeger inherits that lesson in open-source form: the trace backend becomes useful only if enough traces are collected to explain behavior, but not so many that storage and UI surfaces are overwhelmed.

Jaeger's architecture page notes that the OpenTelemetry Collector can support Jaeger's remote sampling protocol and either serve static sampling configurations or proxy requests to the Jaeger backend, including adaptive sampling cases.[1] In practice, that means sampling belongs in architecture review. It is not an afterthought for someone to tune after the bill arrives.

For a platform team, sampling policy should answer three questions before rollout. Which endpoints are always worth seeing because errors or compliance context matter? Which flows can be sampled probabilistically because the aggregate shape is enough? Which high-volume or low-value spans should never be allowed to dominate storage? If those questions are unresolved, Jaeger may still work mechanically, but the trace corpus will be noisy, expensive, or misleading.

Where Jaeger fits best

Jaeger is strongest when a team wants an open tracing backend that can sit naturally inside an OpenTelemetry-oriented estate, keep trace storage under its own control, and scale read/write concerns separately as production pressure grows.[1][2][5] It is especially credible when the team already understands that tracing is not logging with prettier screens. Traces model causal paths across services; the value is in propagation, sampling, storage, and query shape together.

The adoption boundary is equally clear. Jaeger is a weak fit if the team cannot own the storage backend, cannot decide sampling policy, or expects the UI to compensate for undisciplined instrumentation. It is also a weak fit if the organization wants a fully managed vendor experience and has no appetite for operating collectors, storage, retention, upgrades, and query access.

The CNCF context helps, but it should not be overread. Jaeger was accepted into CNCF in 2017 and graduated in 2019, and the CNCF project page currently describes it as a distributed tracing platform with a healthy project score and thousands of contributors across many organizations.[5] That is a maturity signal. It is not a replacement for architecture work.

The cleanest Jaeger rollout starts small but draws production boundaries early: collector/query separation where read and write traffic need different scaling, direct-to-storage while storage headroom is known, Kafka only when buffering is operationally justified, OpenTelemetry Collector layers with explicit jobs, and sampling policy treated as a platform contract. If those pieces are named, Jaeger becomes more than the place traces go. It becomes a tracing backend whose failure modes are visible before the incident.

cronfeed.work