Copy-on-write branching: why forking is free
The data structure that makes it possible to branch a million-row database in microseconds — and why this changes what simulation can be.
Simmis answers a question in three moves: model the domain, simulate scenarios on it, and learn from what happens. This note is the simulate step — why branching a scenario is nearly free, so you run all of them instead of choosing one. Same running decision: should we hire three engineers in Q3 instead of Q4?
When you ask “what if we hired in Q3 instead of Q4?”, Simmis doesn’t copy your entire database to run the scenario. It branches it — a pointer flip, not a data migration. The branch shares everything with the original, costs almost nothing, and can be discarded the moment you’re done.
This isn’t a product trick. It’s a consequence of the data structure underneath. Understanding it changes how you think about what simulation should cost.
The problem with mutation
Most databases work by mutation: when you change a record, you overwrite the old value. This is efficient for single operations, but disastrous for branching. If you want to try two different scenarios simultaneously, you need two copies of the data — O(n) in the size of your dataset. Branching a million-row table means copying a million rows.
This is why most “what-if” analysis is sequential and expensive: you run one scenario, write down the result, restore the original state, then run the next. Tools that don’t do this — like spreadsheets — are limited to a handful of cells because copying the whole sheet for every scenario would grind to a halt.
Immutability changes the game
Persistent data structures take a different approach: values are never modified, only new values are created. The old value always exists, unchanged, and can be referenced from multiple places simultaneously.
The canonical structure is the hash array mapped trie (HAMT) — introduced by Phil Bagwell in Ideal Hash Trees (2001) and the backbone of Clojure’s persistent collections — a tree where leaves are data and internal nodes are routing indices. Here’s a simplified picture:
To update a single leaf, you don’t modify any node. You create a new path from root to that leaf, copying only the nodes along the path. Every other leaf is shared with the original tree through the unchanged nodes.
For a tree of depth d, a fork costs O(d) new nodes — and for a HAMT over millions of entries, d is around 6–10. The number of rows is irrelevant to the fork cost.
What this means for simulation
In a mutable database, running two scenarios simultaneously requires two copies of the data. In a persistent database, you create two forks — each sharing the vast majority of their structure. The cost is proportional to the difference between scenarios, not to the size of the data.
This is the core property that makes Simmis different from a simulation tool that runs one scenario at a time. You can ask “what if” repeatedly, in parallel, without paying for the whole dataset each time. The system can hold hundreds of live scenario branches simultaneously — each independently queryable, each sharing everything it hasn’t changed.
Forking the computation, not just the data
Branching the data is only half of it. Spindel — the reactive runtime in the stack — applies the same copy-on-write idea to running computation: an execution context (every signal, every cached result, every continuation) is a value you can fork in O(1).
require('[org.replikativ.spindel.core :as s :refer [signal spin track]]
'[org.replikativ.spindel.incremental.interval :as iv])
def root: s/create-execution-context()
s/with-context(root
def hires: signal(2)
def velocity: spin(13 * iv/get-new(track(hires))))
def q3: s/fork-context(root)
s/with-context(q3 swap!(hires clojure.core/+ 3))
;; root still sees 2 hires; q3 sees 5 — both live, both recomputing
(require '[org.replikativ.spindel.core :as s :refer [signal spin track]]
'[org.replikativ.spindel.incremental.interval :as iv])
(def root (s/create-execution-context))
(s/with-context root
(def hires (signal 2))
(def velocity (spin (* 13 (iv/get-new (track hires)))))) ; 13 pts/engineer
(def q3 (s/fork-context root)) ; fork the live model — O(1)
(s/with-context q3 (swap! hires + 3)) ; only this branch hires in Q3
;; root still sees 2 hires; q3 sees 5 — both live, both recomputing
So a scenario isn’t just a branched dataset — it’s a branched world, with its reactive computations already wired up and re-running only what the change touched. Branching the database itself is covered in depth on datahike.io; this is the same move one layer up.
The git analogy holds
Git uses the same idea. A commit doesn’t store a full copy of your codebase — it stores a tree of content-addressed objects, and branches are just named pointers into that tree. Branches are cheap in git because switching branches doesn’t copy files; it changes a pointer.
Datahike — the database at the bottom of the Simmis stack — applies this to data. Every transaction creates a new root pointer into an immutable tree. You can have as many named “heads” as you want, and querying any historical state means querying the tree rooted at that state’s snapshot.
Why Clojure
The Simmis stack is written in Clojure, and this isn’t incidental. Clojure is the language that most systematically embeds persistent data structures into everyday programming — every standard vector, map, and set is a persistent, structurally-shared HAMT collection by default. You can’t accidentally mutate one. This is the same model described above: updates create new roots, unchanged structure is shared by pointer, and the old value is never destroyed.
That’s not universal. Haskell is arguably equal in theoretical purity — all data is immutable, and Okasaki’s purely functional data structures are part of the academic foundation this work builds on. Scala has persistent collections in its standard library. But in both cases the model competes with mutable alternatives, and the emphasis is weaker. In Clojure it’s the default, and the concurrency primitives — atoms, refs, agents — are all built on top of it. Rich Hickey designed the language explicitly around separating identity from value and time from state. The data structure philosophy propagates through the entire stack: Datahike (knowledge graph), Stratum (columnar analytics), and Spindel (reactive runtime) all apply the same model at their respective layers.
The one honest exception is the distributed boundary. When state spans multiple machines, you can’t have both consistency and availability under partition — the CAP theorem applies. Yggdrasil, our coordination layer, targets causal consistency: the weakest ordering guarantee that still lets you reason about cause and effect across nodes. Within a node, copy-on-write gives you perfect reproducibility. Across nodes, causal ordering is the limit. This is an architectural seam we don’t try to hide.
The deeper implication
When forking is cheap, the economics of simulation change. You don’t need to choose which scenario to run — you run all of them. You don’t need to plan the branching structure upfront — you branch opportunistically, whenever a new question arises.
This changes what simulation can be. Instead of a planned exercise — “let’s run a Monte Carlo study this quarter” — it becomes a natural part of how you think. You ask a question, the system forks, returns a result, and discards the fork if you don’t need it. The cost of a “what if” approaches zero.
That’s the real consequence of persistent data structures: they make exploration cheap enough that you do it constantly, not occasionally.
Simmis is built on these ideas. We're in early access — come think it through with us.