Gomoku — an AlphaZero, and the lab that trained it

01 — The Harness

A small crew of agents,
and one rule they never break.

Built in the spirit of “stop babysitting your agents”: dispatch the work, verify by receipt, and never block code behind the GPU. Several agents cooperate — each owning one domain — and exactly one of them is ever allowed to touch the GPU at a time.

Orchestrator

owns both queues

The live session driving the loop. Spins GPU slices, fans code & wiki work out to parallel sub-agents, and reads the queue for resume state on every wake.

Derby Runner

single serial GPU executor

The only process the GPU sees. Runs the race in ~300-second chunks; on a ~10-minute cron it asserts health, reads the scoreboard, and swaps stalled lanes by judgement.

Derby Register

intake

Routes a new idea. A one-flag lever clones a cell; a code-heavy lever files a GitHub issue to be built. Enforces one lever per cell — and never runs the GPU.

Reviewer

the audit gate

Spawned fresh after each finished lane to check the math and the receipts against the charter. Three verdicts — APPROVE / REVISE / BLOCK — and only BLOCK halts the loop.

Issue Runner

code-only dispatcher

Polls the GitHub ready queue, claims unblocked code-only issues, and hands each to an isolated worktree worker. Mirrors status back; touches no GPU.

Janitor + Watchdog

hygiene & heartbeat

At session start the janitor reclaims worktrees leaked by crashed sessions and prints a hygiene gauge. An off-minute cron re-invokes the loop and pushes a one-line “needs you” on escalation.

GPU lane — strictly serial code lane — fans out in parallel meta lane — audits & upkeep

02 — The Protocol

The rules that let it run unattended.

A handful of conventions do all the work. They keep parallel sessions from entangling, make every result reproducible, and let the loop default to action instead of asking permission.

GPU-required serial · one at a time

train slice--max-wall-secs

self-play generationwave-batched

eval probevs fixed baselines

Heavy work runs one lane at a time on MPS. The GPU is the bottleneck, so it is never contended.

Everything else parallel · fan-out

code

wiki

plots

audits

Code, docs and analysis fan out through worktrees and sub-agents — never serialized behind the GPU.

R1Receipts are
the record

A lane that ran without filing a receipt — a ledger entry, baseline rows, a line in the narrative log — is invisible to the next session. The receipt is the work; the verdict carries the run ID, config, checkpoint and metrics.

R2Worktree per
unit of work

Every change gets its own git worktree off main, merged back with merge --no-ff. Never rebase, never squash — explicit merge commits keep each experiment legible in the graph while the derby and other sessions share the repo concurrently.

R3Smoke
first

The default probe is 60–90 seconds, not five minutes. Escalate to a longer run only when the signal sits inside the noise floor. Cheap evidence before expensive evidence; small-n is a hint, not a verdict.

R4Autonomy is
a deny-list

Default-allow. Reversible local work just happens — no asking. Only hard-to-reverse or architectural moves (a remote push, a dependency bump, a model-architecture change) stop to confirm. Timing and context are not the same as permission.

R5Verify before
you trust

Strength is measured against fixed baselines — a heuristic player and lookahead-N — never sibling head-to-head, which is non-transitive. A recipe that changes training behaviour can’t be promoted until a canary confirms it didn’t quietly regress quality.

Δelo Δt

The north star — “delta-e”

Elo gained per wall-clock hour, measured from a common checkpoint against a stable anchor. Throughput proxies like augmentations-per-second are gameable means; this is the end. A generator that doubles aug/s but floods the trainer into a runaway loses on Δelo/Δt — and the lab is built to catch exactly that.

And the engine itself: the Δelo Derby races training recipes — each differing from the reigning champion by exactly one lever — toward a wall-time cap. A priority queue feeds the steepest climber first, so leaders get more GPU while everyone still finishes, and the board always shows the best model so far. State is written atomically after every chunk, so an overnight run resumes from a crash without losing a step.

03 — The Method

Enough AlphaZero
to tell the truth about it.

No human games, no openings book, no curriculum. The system learns entirely from its own play, search and evaluation looping into each other.

Self-play

The model plays itself to generate every training game. The only teacher is its own search.

PUCT MCTS

A tree search that balances the network’s move prior (explore) against accumulated win-rate (exploit) to choose a move.

Policy + value net

One residual trunk, two heads: predicted move probabilities, and predicted outcome from this position.

The loop

Train on searched games → generate stronger games with the new weights → evaluate vs fixed baselines → repeat.

Wave-batched eval

Leaf positions from many parallel games are pooled into one forward pass to keep the GPU saturated.

D4 augmentation

Every position is rotated and mirrored into its 8 board symmetries — eight times the data, for free.

The thing that had to be solved: fast-attack collapse.

Early on, both selves play equally weak defence — so MCTS at a small budget keeps confirming its own attacking prior and never explores the blocks. Games degenerate into a 5-to-10-move blitz where whoever strikes first wins, and the network never learns to defend. The value head finds the result trivially easy to predict, which hides the problem.

More simulations alone doesn’t fix it: the replay buffer itself is starved of the positions that would teach defence. The fix had to change what the model sees, not just how hard it looks.

What actually worked.

Four levers, each isolated and raced in the derby. Stacked, they break the collapse and climb past the whole opponent ladder.

In-search VCF

confirmed

A fast solver checks each leaf for a forced win-by-continuous-fours and backs up a proven value. Teaches the net to see forcing threats without having to search them. → a decisive win over the baseline.

Global-pool sampling

confirmed

Sample the replay buffer uniformly across all of history instead of crowding recent games. Flattens the buffer’s composition and breaks the fast-attack feedback loop. → defence finally gets gradient.

Value-discount (γ≈0.98)

confirmed

Slightly discount slow wins, so a quick, clean win trains a sharper gradient than a long grind. → broke the defensive-draw stall without re-collapsing into blitz.

FPU reduction

eval-only · free

At evaluation time, give unvisited moves a smarter default than zero so the search spreads more evenly. Zero training cost. → cracks the deepest lookahead opponents.

04 — The Outcome

100%

of the opponent ladder, solved

vcf + global-pool + value-discount

The winning recipe, built one lever at a time. Champion: run 44cxzc9d, trained from scratch on the M5 Max.

Fixed opponent	win as Black	loss as White
Heuristic	100%	0%
Lookahead-2	100%	0%
Lookahead-4	100%	0%
Lookahead-6 *	~86–90%	0%
Lookahead-8 *	100%	0%

The stated 100% target — win all as Black, lose none as White against the heuristic and lookahead-2/4 — is met at default settings, with no eval-time tricks.
* the eval-only levers (FPU + tree-reuse) then generalise to the harder lookahead-6/8 opponents.

A small crew of agents,and one rule they never break.

Orchestrator

Derby Runner

Derby Register

Reviewer

Issue Runner

Janitor + Watchdog

The rules that let it run unattended.

GPU-required serial · one at a time

Everything else parallel · fan-out

The north star — “delta-e”

Enough AlphaZeroto tell the truth about it.

The thing that had to be solved: fast-attack collapse.

What actually worked.

In-search VCF

Global-pool sampling

Value-discount (γ≈0.98)

FPU reduction

A small crew of agents,
and one rule they never break.

Enough AlphaZero
to tell the truth about it.