A from-scratch AlphaZero for 9×9 free-style gomoku on a single Mac. The model is the easy part. The real artifact is the harness: a receipt-driven lab of cooperating agents that runs the experiments while no one is watching.
Built in the spirit of “stop babysitting your agents”: dispatch the work, verify by receipt, and never block code behind the GPU. Several agents cooperate — each owning one domain — and exactly one of them is ever allowed to touch the GPU at a time.
The live session driving the loop. Spins GPU slices, fans code & wiki work out to parallel sub-agents, and reads the queue for resume state on every wake.
The only process the GPU sees. Runs the race in ~300-second chunks; on a ~10-minute cron it asserts health, reads the scoreboard, and swaps stalled lanes by judgement.
Routes a new idea. A one-flag lever clones a cell; a code-heavy lever files a GitHub issue to be built. Enforces one lever per cell — and never runs the GPU.
Polls the GitHub ready queue, claims unblocked code-only issues, and hands each to an isolated worktree worker. Mirrors status back; touches no GPU.
A handful of conventions do all the work. They keep parallel sessions from entangling, make every result reproducible, and let the loop default to action instead of asking permission.
Heavy work runs one lane at a time on MPS. The GPU is the bottleneck, so it is never contended.
Code, docs and analysis fan out through worktrees and sub-agents — never serialized behind the GPU.
A lane that ran without filing a receipt — a ledger entry, baseline rows, a line in the narrative log — is invisible to the next session. The receipt is the work; the verdict carries the run ID, config, checkpoint and metrics.
Every change gets its own git worktree off main, merged back with merge --no-ff. Never rebase, never squash — explicit merge commits keep each experiment legible in the graph while the derby and other sessions share the repo concurrently.
The default probe is 60–90 seconds, not five minutes. Escalate to a longer run only when the signal sits inside the noise floor. Cheap evidence before expensive evidence; small-n is a hint, not a verdict.
Default-allow. Reversible local work just happens — no asking. Only hard-to-reverse or architectural moves (a remote push, a dependency bump, a model-architecture change) stop to confirm. Timing and context are not the same as permission.
Strength is measured against fixed baselines — a heuristic player and lookahead-N — never sibling head-to-head, which is non-transitive. A recipe that changes training behaviour can’t be promoted until a canary confirms it didn’t quietly regress quality.
Elo gained per wall-clock hour, measured from a common checkpoint against a stable anchor. Throughput proxies like augmentations-per-second are gameable means; this is the end. A generator that doubles aug/s but floods the trainer into a runaway loses on Δelo/Δt — and the lab is built to catch exactly that.
And the engine itself: the Δelo Derby races training recipes — each differing from the reigning champion by exactly one lever — toward a wall-time cap. A priority queue feeds the steepest climber first, so leaders get more GPU while everyone still finishes, and the board always shows the best model so far. State is written atomically after every chunk, so an overnight run resumes from a crash without losing a step.
No human games, no openings book, no curriculum. The system learns entirely from its own play, search and evaluation looping into each other.
The model plays itself to generate every training game. The only teacher is its own search.
A tree search that balances the network’s move prior (explore) against accumulated win-rate (exploit) to choose a move.
One residual trunk, two heads: predicted move probabilities, and predicted outcome from this position.
Train on searched games → generate stronger games with the new weights → evaluate vs fixed baselines → repeat.
Leaf positions from many parallel games are pooled into one forward pass to keep the GPU saturated.
Every position is rotated and mirrored into its 8 board symmetries — eight times the data, for free.
Early on, both selves play equally weak defence — so MCTS at a small budget keeps confirming its own attacking prior and never explores the blocks. Games degenerate into a 5-to-10-move blitz where whoever strikes first wins, and the network never learns to defend. The value head finds the result trivially easy to predict, which hides the problem.
More simulations alone doesn’t fix it: the replay buffer itself is starved of the positions that would teach defence. The fix had to change what the model sees, not just how hard it looks.
Four levers, each isolated and raced in the derby. Stacked, they break the collapse and climb past the whole opponent ladder.
A fast solver checks each leaf for a forced win-by-continuous-fours and backs up a proven value. Teaches the net to see forcing threats without having to search them. → a decisive win over the baseline.
Sample the replay buffer uniformly across all of history instead of crowding recent games. Flattens the buffer’s composition and breaks the fast-attack feedback loop. → defence finally gets gradient.
Slightly discount slow wins, so a quick, clean win trains a sharper gradient than a long grind. → broke the defensive-draw stall without re-collapsing into blitz.
At evaluation time, give unvisited moves a smarter default than zero so the search spreads more evenly. Zero training cost. → cracks the deepest lookahead opponents.
The winning recipe, built one lever at a time. Champion: run 44cxzc9d, trained from scratch on the M5 Max.
| Fixed opponent | win as Black | loss as White |
|---|---|---|
| Heuristic | 100% | 0% |
| Lookahead-2 | 100% | 0% |
| Lookahead-4 | 100% | 0% |
| Lookahead-6 * | ~86–90% | 0% |
| Lookahead-8 * | 100% | 0% |
The stated 100% target — win all as Black, lose none as White against the heuristic and lookahead-2/4 — is met at default settings, with no eval-time tricks.
* the eval-only levers (FPU + tree-reuse) then generalise to the harder lookahead-6/8 opponents.