Copy-on-write branching: why forking is free

The data structure that makes it possible to branch a million-row database in microseconds — and why this changes what simulation can be.

When you ask “what if we hired in Q3 instead of Q4?”, Simmis doesn’t copy your entire database to run the scenario. It branches it — a pointer flip, not a data migration. The branch shares everything with the original, costs almost nothing, and can be discarded the moment you’re done.

This isn’t a product trick. It’s a consequence of the data structure underneath. Understanding it changes how you think about what simulation should cost.

The problem with mutation

Most databases work by mutation: when you change a record, you overwrite the old value. This is efficient for single operations, but disastrous for branching. If you want to try two different scenarios simultaneously, you need two copies of the data — O(n) in the size of your dataset. Branching a million-row table means copying a million rows.

This is why most “what-if” analysis is sequential and expensive: you run one scenario, write down the result, restore the original state, then run the next. Tools that don’t do this — like spreadsheets — are limited to a handful of cells because copying the whole sheet for every scenario would grind to a halt.

Immutability changes the game

Persistent data structures take a different approach: values are never modified, only new values are created. The old value always exists, unchanged, and can be referenced from multiple places simultaneously.

The canonical structure is the hash array mapped trie (HAMT) — a tree where leaves are data and internal nodes are routing indices. Here’s a simplified picture:

rootnode Anode Bval: 12val: 47val: 83val: 31target leaf
A simplified persistent trie. Lookups traverse from root to leaf. Nodes are never modified.

To update a single leaf, you don’t modify any node. You create a new path from root to that leaf, copying only the nodes along the path. Every other leaf is shared with the original tree through the unchanged nodes.

originalrootnode Anode B12478331 ← oldfork (new nodes only)root’node B’99 ← newshared (pointer)shared (pointer)3 new nodes. All other data shared.
Path copying: only the root, one internal node, and one new leaf are allocated. Everything else is referenced by pointer from the original tree.

For a tree of depth d, a fork costs O(d) new nodes — and for a HAMT over millions of entries, d is around 6–10. The number of rows is irrelevant to the fork cost.

What this means for simulation

In a mutable database, running two scenarios simultaneously requires two copies of the data. In a persistent database, you create two forks — each sharing the vast majority of their structure. The cost is proportional to the difference between scenarios, not to the size of the data.

base modelQ3 hireQ4 hireQ3 + staggeredQ4 + contractorshared structure
Four scenarios, all branching from the same base model. Each additional fork costs O(log n) — the shared structure is just pointers.

This is the core property that makes Simmis different from a simulation tool that runs one scenario at a time. You can ask “what if” repeatedly, in parallel, without paying for the whole dataset each time. The system can hold hundreds of live scenario branches simultaneously — each independently queryable, each sharing everything it hasn’t changed.

The git analogy holds

Git uses the same idea. A commit doesn’t store a full copy of your codebase — it stores a tree of content-addressed objects, and branches are just named pointers into that tree. Branches are cheap in git because switching branches doesn’t copy files; it changes a pointer.

Datahike — the database at the bottom of the Simmis stack — applies this to data. Every transaction creates a new root pointer into an immutable tree. You can have as many named “heads” as you want, and querying any historical state means querying the tree rooted at that state’s snapshot.

Copy-on-write in the stack Datahike handles the structured knowledge graph. Stratum handles columnar analytics. Both use copy-on-write persistence. The cost of forking is O(log n) in the changed path length — not O(n) in dataset size. For most real-world scenarios, this is effectively constant. (Retaining many long-lived forks does accumulate storage — the cheapness is in the branching, not in indefinite retention.)

Why Clojure

The Simmis stack is written in Clojure, and this isn’t incidental. Clojure is the language that most systematically embeds persistent data structures into everyday programming — every standard vector, map, and set is a persistent, structurally-shared HAMT collection by default. You can’t accidentally mutate one. This is the same model described above: updates create new roots, unchanged structure is shared by pointer, and the old value is never destroyed.

That’s not universal. Haskell is arguably equal in theoretical purity — all data is immutable, and Okasaki’s purely functional data structures are part of the academic foundation this work builds on. Scala has persistent collections in its standard library. But in both cases the model competes with mutable alternatives, and the emphasis is weaker. In Clojure it’s the default, and the concurrency primitives — atoms, refs, agents — are all built on top of it. Rich Hickey designed the language explicitly around separating identity from value and time from state. The data structure philosophy propagates through the entire stack: Datahike (knowledge graph), Stratum (columnar analytics), and Spindel (reactive runtime) all apply the same model at their respective layers.

The one honest exception is the distributed boundary. When state spans multiple machines, you can’t have both consistency and availability under partition — the CAP theorem applies. Yggdrasil, our coordination layer, targets causal consistency: the weakest ordering guarantee that still lets you reason about cause and effect across nodes. Within a node, copy-on-write gives you perfect reproducibility. Across nodes, causal ordering is the limit. This is an architectural seam we don’t try to hide.

The deeper implication

When forking is cheap, the economics of simulation change. You don’t need to choose which scenario to run — you run all of them. You don’t need to plan the branching structure upfront — you branch opportunistically, whenever a new question arises.

This changes what simulation can be. Instead of a planned exercise — “let’s run a Monte Carlo study this quarter” — it becomes a natural part of how you think. You ask a question, the system forks, returns a result, and discards the fork if you don’t need it. The cost of a “what if” approaches zero.

That’s the real consequence of persistent data structures: they make exploration cheap enough that you do it constantly, not occasionally.

Simmis is built on these ideas. We're in early access — come think it through with us.

Get in touch Try Simmis