Skip to main content

Why do small per-step error rates cause large multi-step agent failures?

Your agent is 95% reliable per step. That sounds great. Over a 20-step task it succeeds about a third of the time. Here is the math that quietly kills long-horizon agents, and how to beat it.

B

Balagei G Nagarajan

7 MIN READ


A chain of agent steps where small cracks in each link compound until the chain breaks

Each step looks almost perfect. The small cracks compound down the chain until the whole thing breaks. Reliability multiplies.

Short answer. Because reliability multiplies, it does not average. An agent that is 95% reliable on each step is not 95% reliable overall. Across a 20-step task that is 0.95 to the 20th power, roughly 36% end to end. Small per-step error rates that look excellent in isolation compound into majority failure once the chain is long enough.

Key facts.

  • The math is unforgiving: at 95% per-step reliability, a 20-step task succeeds about 36% of the time; even at an unusually high 99% per step, it is only about 82%.
  • Success collapses with task length in practice: frontier models hit near-100% on tasks a human finishes in under 4 minutes, but under 10% on tasks taking over about 4 hours (METR, Measuring AI Ability to Complete Long Tasks, 2025).
  • A per-step error as small as 0.22% drives near-certain failure after a few hundred steps; decomposition plus per-step voting drove errors to zero across a million steps (MAKER, 2025).
  • Multi-agent framework failure rates run 41% to 86.7% (Cemri et al., MAST), and error propagation is identified as the dominant agent failure pattern (AgentErrorBench).

Why does reliability multiply instead of average?

Because a multi-step task only succeeds if every step succeeds. If step one works 95% of the time and step two works 95% of the time, the chance both work is 0.95 times 0.95, about 90%, not 95%. Extend that to twenty steps and you multiply 0.95 by itself twenty times, which lands near 36%. Each step is a gate, and the gates are in series, so their failure probabilities stack. This is the same reason a chain is only as strong as the product of its links, not their average. The per-step number can look excellent while the end-to-end number is a coin flip or worse.

Why does a 95% step rate feel safe but fail in practice?

Because human intuition averages, and the system multiplies. A 95% reliability reads like an A grade, so teams ship it. But an agent does not get partial credit for getting nineteen of twenty steps right; one wrong step derails the task. The longer the workflow, the wider the gap between the comforting per-step number and the brutal end-to-end one. This is why agents demo beautifully on short happy paths and fall apart on real, long tasks: the demo had three steps, production has twenty, and the math turned against you somewhere in between.

A curve showing end-to-end success collapsing as the number of steps grows, for 95% and 99% per-step reliability

End-to-end success against number of steps. At 95% per step it falls off a cliff; even 99% sags. Length is the enemy.

Why does this hit agents specifically?

Because agents are long chains of probabilistic decisions. Every tool call, every plan step, every handoff is another gate that can fail, and agents string many of them together. A traditional program runs deterministic steps that either work every time or fail loudly; an agent runs steps that each work most of the time, which is exactly the regime where compounding bites hardest. Worse, errors do not just drop a step, they propagate. Each step's output is appended to the context and conditions the next call, so the agent self-conditions on its own mistake, treating a hallucinated value as established fact and building on it. One bad step poisons every downstream decision, which is why the AgentErrorBench analysis names error propagation the single biggest barrier to dependable agents, and why chains often go off the rails by the third bad call. Architecture matters too: one scaling study found decentralized agent setups amplified errors roughly 17 times versus about 4 times for orchestrated ones.

How do you beat the math?

LeverWhy it worksHow
Fewer stepsLess to multiply throughDecompose less; collapse steps; constrain scope
Higher per-step reliabilityEach gate leaks lessBetter tools, typed contracts, retries with backoff
Verification gatesCatch a failure before it compoundsRead-back and checks at each critical step
Error recoveryA failed step does not end the taskDetect, correct, and re-attempt instead of propagating
CheckpointsResume instead of restartPersist state so one failure is local, not total

The strategic move is to stop accepting the chain length as fixed. Every step you remove improves the odds geometrically, so the highest-leverage reliability work is often deleting steps, not perfecting them. After that, add verification at the steps that matter so a failure is caught before it multiplies, and build recovery so a single bad step is a local event rather than a total loss. Knowing which steps in your workflow are the weak links, and which are reliable enough to leave alone, is the pattern-level reliability VibeModel builds as the Pattern Intelligence Layer.

Frequently asked questions

If each step is 95% reliable, why isn't the whole task 95%?
Because the task needs every step to work, and the probabilities multiply. 0.95 across 20 steps is about 36%, not 95%. Reliability in series multiplies; it does not average.

Does a smarter model fix this?
Only partially. A better model raises the per-step rate, which helps, but the multiplication still bites on long tasks. The bigger wins usually come from fewer steps, verification, and error recovery, not just a stronger model.

Why do agents demo well but fail in production?
Demos are short, so the compounding has few steps to act on. Production tasks are long, so the same per-step reliability produces a far lower end-to-end success rate. The math turns against you as the chain grows.

What is the highest-leverage fix?
Cut steps. Because reliability multiplies, removing a step improves success geometrically. Then add verification at the critical steps so a failure is caught before it compounds through the rest.


Share this post

Join the discussion

Have a take, a war story, or a question? Sign in with GitHub to comment and react. Comments are powered by GitHub Discussions, ad-free and yours to moderate.

Continue Reading

See AI reliability in action

Try pattern discovery on real datasets in the VibeModel playground.