You kicked off a twelve-agent run, walked away for coffee, and came back to a progress tree that has not moved in twenty minutes. One box still spinning, nothing errored, nothing finished. So you kill it, start over, and watch the token meter climb a second time to pay for work that half those agents already did. That second bill is the most expensive thing about running agent swarms, and almost none of it is necessary.
So I did something recursive: I pointed a multi-agent workflow at the question of why multi-agent workflows stall, then handed one agent my own answer and told it to tear it apart. The fix that survived was smaller than the one I started with, and that was the lesson. You do not harden a swarm by adding machinery. You harden it by leaning on what the orchestration layer already does, and adding the few things it does not.
Most of the stall-proofing people write by hand is already built in, and the dynamic workflows runtime spells out the parts that matter. You get per-item fault containment, so one bad item drops out and the rest keep flowing. A dead agent comes back as a null instead of crashing the run. Concurrency is capped, so you cannot summon a thousand agents at once. Structured output self-heals when an agent returns the wrong shape. And a run resumes: relaunch it and every unchanged step replays from cache, so only new work costs anything. Rebuild any of that by hand and you usually end up fighting it.
Two words trip people up, so in plain terms: parallel means the run waits for every agent in a batch to finish before it moves on, and pipeline means each item moves ahead on its own without waiting for the others. With that settled, here are the six rules, each one closing a specific way runs die:
Rules three through five are most of the actual code. A budgeted, bounded, deterministic spawn loop is about ten lines:
const RESERVE = 0.05 * (budget.total || 1); // headroom for the next wave
let work = seed, done = [];
for (let wave = 0; wave < MAX_WAVES; wave++) { // rule 4: a hard ceiling
if (budget.total && budget.remaining() < RESERVE) break; // rule 3: budget guard
const out = (await pipeline(work.slice(0, 16), ...stages))
.filter(Boolean); // drop the dead agents
done.push(...out);
work = work.slice(16).concat(discoverMore(out)); // new work goes to the END (rule 5)
if (!work.length) break;
}
return { results: done, dropped: seed.length - done.length }; // rule 2: fail loud in the return
A workflow you launch directly runs in the background and pings you when it finishes. It has no heartbeat. If it wedges in the middle, nothing notices, because the thing that wedged is the thing that would have to notice. So the watchdog has to live one level up: wrap the launch in a loop with a long fallback timer (I use about twenty-five minutes, never five, since the cache only stays warm that long), and the loop wakes on its own, sees the run is stuck, and relaunches it with resume. Resume only works within the same session and only if you kept rule five, but when it holds it turns a dead twenty-minute run into a thirty-second catch-up.
The watchdog is not a timer inside your script. It is the loop around your script.
I did not want to retype any of this. The six rules and the watchdog note live in one file,
~/.claude/resilience-preamble.md, and a single line in my standing instructions points at
it:
For any multi-agent Workflow, apply ~/.claude/resilience-preamble.md.
After that, starting a resilient run is one sentence: build this per that file, and run it under a loop with a watchdog. That is the whole ritual, and I never turned it into a thirteenth standing rule I would have to remember to obey.
Set it up once: one file, one pointer line, one loop. After that, resilience is a sentence at the top of a request, not something you sit at your desk to babysit.
Say I want to audit three hundred Terraform files for missing owner tags. The whole interaction is one sentence:
Audit every .tf file under infra/ for missing owner tags. Build it as a resilient workflow per ~/.claude/resilience-preamble.md, and run it under a loop with a watchdog.
The run fans the files out in waves of sixteen, each wave a pipeline so a slow file never blocks the
others. Around file one hundred eighty an agent wedges on a giant module and the run goes quiet.
Twenty-five minutes later the loop wakes up, sees nothing finished, and relaunches with resume: the
files that already passed come back from cache for free, and it carries on from where it died. I never
touched it. It ends by handing back { results: 297, dropped: 3 }, and because rule two put
that dropped count in the return value, I know to eyeball those three files by hand instead of
assuming all three hundred came back clean.
It is not a silver bullet, and the honest edges are worth naming. A timeout cannot interrupt a synchronous hot loop; only the watchdog one level up can. Moving on from a hung agent does not cancel it, so it keeps holding its slot until it really dies. And a budget that runs out mid-wave gives you a partial result, not a clean stop. None of those are reasons to skip the setup. They are the reason to trust the loop around your script and not a timer inside it, which is exactly why the smaller recipe beat the bigger one.
I served in the U.S. Army, specializing in Network Switching Systems and was attached to a Patriot Missile System Battalion. After my deployment and Honorable discharge, I went to college in Jacksonville, FL for Computer Science. I have two beautiful and very intelligent daughters. I have more than 20 years professional IT experience. This page is made to learn and have fun. If its messed up, let me know. Im still learning :)