Ceph's real trick is failure arithmetic: an architecture note on CRUSH, placement groups, and the BlueStore recovery budget

This rack photograph fits the article because Ceph only becomes understandable when storage is treated as a physical hierarchy of hosts, racks, and failure domains rather than as one abstract pool of disks.[4][9]

Ceph is often described as unified storage: one system that can serve objects, blocks, and files at very large scale.[1] That description is correct, but it is still downstream of the design choice that matters most. Ceph's real trick is that it turns failure handling into arithmetic. Instead of keeping object locations behind a central metadata broker, it teaches clients and OSDs to compute placement from the same cluster maps and CRUSH rules, then lets recovery happen at coarser, controllable units rather than at the level of every individual object.[1][4][6][7]

That is the architecture note worth carrying into 2026. Ceph is not hard merely because distributed storage is hard. It is hard in a very specific way: the system pushes topology, placement policy, and recovery cost into places operators can tune. If you understand CRUSH, placement groups, and BlueStore together, Ceph stops looking like a giant storage brand and starts reading like a machine for bounding what happens when a disk, a host, or a rack disappears.[1][2][3][4][5]

As of 2026-04-06T22:04:09Z UTC, the GitHub API reports 16,422 stars, 6,349 forks, 1,168 open issues, and a most recent push at 2026-04-06T20:43:13Z for ceph/ceph.[8] Those numbers do not prove architectural quality, but they do tell us Ceph is still a living system rather than a frozen paper design.

Image context: the cover uses a real server-rack photograph, not a vendor diagram. That choice matters because Ceph's placement logic only becomes concrete when you can picture hardware grouped into hosts, racks, power domains, and network neighborhoods. CRUSH is valuable precisely because these physical boundaries are real.[4][9]

1. CRUSH removes the most dangerous central lookup path

The official architecture docs are blunt about the old problem shape. Traditional storage systems often force clients through a centralized gateway, broker, or facade, which becomes both a single point of failure and a scalability ceiling.[1] Ceph's answer is to remove that placement broker. Clients ask monitors for the current cluster maps, but clients and OSDs then use the CRUSH algorithm to compute where data belongs instead of querying a central placement table on every read or write.[1][4]

That detail is the core of Ceph's "failure arithmetic." The CRUSH map does not only list OSDs. It also records the hierarchy that matters when failures correlate: hosts, racks, rows, data centers, and device classes such as hdd, ssd, or nvme.[4] The rules then describe how replicas or erasure-coded shards should spread across that hierarchy.[4][7] If an operator says, in effect, "three replicas in different hosts" or "these shards must span racks," the placement logic stays local and algorithmic instead of becoming an ever-growing lookup service.[4][7]

This is why Ceph's topology model is not cosmetic metadata. It is the thing that lets the cluster preserve availability while hardware fails in clumps rather than one disk at a time. A rack is not just a label. It is a statement about shared power, shared switching, and shared bad days.[4]

2. Placement groups are the unit that keeps movement legible

CRUSH alone would not be enough if Ceph had to manage placement one object at a time. The data-placement docs explain the second move: objects are first mapped into pools, then into placement groups, and only then placed onto OSD sets.[2][3] A placement group is a shard of a logical pool, and Ceph manages data internally at PG granularity because this scales better than tracking every RADOS object independently.[2][3]

That design decision is more important than it sounds. Placement groups are the unit at which balancing, peering, and recovery become manageable. When a host drops out or new storage arrives, Ceph does not have to rethink the universe object by object. It moves and reconciles PGs. That is why the docs keep talking about PG count as a balance between two costs: too few PGs and data concentrates on too small a slice of the cluster; too many PGs and the system pays too much in peering traffic, metadata overhead, and RAM.[2][3][10]

The current guidance is concrete enough to matter. The placement-group docs say mon_target_pg_per_osd defaults to 100, recommend about 200 for all but the smallest deployments, and warn that values above 500 may create excessive peering traffic and memory usage.[3] That is exactly the kind of engineering boundary good infrastructure software should publish. It tells operators that Ceph's scalability is not magic. It is a set of tradeoffs with operating ranges.

The newer PG autoscaler makes the same philosophy clearer rather than hiding it.[3][10] Even when the system proposes or applies pg_num changes automatically, it is still reasoning about pool size, replica or erasure-coding rate, CRUSH subtrees, and the target PG budget per OSD.[3][10] Inference from the docs: Ceph is willing to automate the math, but it never stops being a system whose behavior is shaped by topology and movement cost.

3. BlueStore keeps each OSD close to the device it must recover

The third part of the story lives under each OSD. Ceph's architecture page notes that BlueStore is now the default backend and stores objects in a monolithic, database-like fashion.[1] The BlueStore configuration reference sharpens that into a more useful mental model: BlueStore writes directly to raw devices rather than creating and mounting a conventional filesystem first.[5]

That choice matters because recovery behavior starts at the OSD boundary. A BlueStore OSD has a primary block device and can optionally split out block.db and block.wal onto faster media.[5] If fast storage is scarce, the docs say to use it as a WAL device; if more is available, it is better used for block.db, because the WAL will colocate there anyway while metadata also benefits.[5] For mixed HDD-plus-SSD layouts, the docs recommend putting block.db on the faster device, and they even give sizing guidance: typically 1% to 4% of the block size, around 4% for RGW-heavy metadata workloads, and often 1% to 2% for RBD-heavy workloads.[5]

This is not an implementation footnote. It is part of Ceph's recovery budget. BlueStore keeps the OSD closer to raw-device reality so operators can decide where metadata, write-ahead behavior, and cache pressure belong. The same page notes that BlueStore cache autotuning is enabled by default and tries to keep OSD heap usage under osd_memory_target, with floor and ratio controls available when the default budget is wrong for a workload.[5] In other words, Ceph does not pretend that all disks are interchangeable or that storage engines should hide their hot paths completely. It exposes enough of the path that operators can match layout to workload.

4. Best-fit boundary

Read together, these sources point to a narrower conclusion than the usual "Ceph scales" slogan. Ceph is strongest when a team really needs storage placement to follow physical reality: multiple hosts, meaningful failure domains, mixed device classes, and enough scale that a central placement service would become a liability.[1][2][4][6][7] In that environment, CRUSH keeps placement decentralized, PGs keep movement bounded, and BlueStore gives each OSD a storage-engine shape that can be tuned instead of guessed at.[1][3][4][5]

The mismatch is also clear. If an estate is too small to care about rack-aware placement, too operationally thin to manage topology and pool policy, or expecting the system to erase every tuning decision, Ceph will feel heavier than its benefits justify. The docs themselves keep returning to the same truth: pools, CRUSH rules, PG budgets, device classes, WAL and DB layout, and cache targets all matter.[2][3][4][5] Ceph's real product is not simplicity. It is controlled complexity in exchange for bounded failure behavior.

That is why "failure arithmetic" is the right way to read Ceph in 2026. The system does not eliminate hard storage problems. It chooses where they live, makes the important boundaries explicit, and gives operators a way to express physical risk as placement policy instead of as folklore.

cronfeed.work

Ceph's real trick is failure arithmetic: an architecture note on CRUSH, placement groups, and the BlueStore recovery budget

1. CRUSH removes the most dangerous central lookup path

2. Placement groups are the unit that keeps movement legible

3. BlueStore keeps each OSD close to the device it must recover

4. Best-fit boundary

Sources

Recommended In oss