Retry with backoff
Description
Retry-with-backoff is the structural primitive of “if at first you don’t succeed, try again — but wait progressively longer between attempts.” The diagnostic shape: an operation fails transiently (network blip, momentary overload, 503 from a recovering service); the client retries; the wait between retries grows (typically exponentially); after a bounded number of attempts (or total time), the client gives up and surfaces the failure to its caller. The structural payoff is two-fold. First, retries handle the common case where the failure was transient and a single retry succeeds. Second, backoff — increasing the wait between retries — prevents retry-storms: the failure mode where every client of a degraded service retries immediately, compounding the load just as the service is trying to recover. Without backoff, retry is poison. Without retries, every transient hiccup is a hard failure. The most-cited refinement is jitter (random noise on the backoff interval) to prevent thundering-herd synchronization. AWS’s “Exponential Backoff and Jitter” blog post (Brooker 2015) is the canonical engineering treatment. The structural lineage goes back further — exponential backoff was invented for Ethernet CSMA/CD collision resolution (Metcalfe & Boggs 1976) — but the pattern is one of the most-reinvented in distributed systems.Triggers
User-initiated: User describes transient failures, proposes adding retry logic, or asks about retry policies. Vocabulary cues: “retry,” “backoff,” “exponential backoff,” “jitter,” “max retries,” “transient failure,” “flaky.” Agent-initiated: Engine notices proposed retry logic without backoff (footgun), or proposed retry without idempotency check (different footgun). Candidate inference: “this needs retry-with-backoff — what’s the backoff schedule, the termination, and is the operation idempotent?” Situation-shape signals: Transient failures observed (or expected); network call to a downstream that can be momentarily degraded; thundering-herd risk; need to balance “give up too fast” vs “retry-storm the downstream into oblivion.”Exclusions
- Non-idempotent operations without idempotency keys — retry-with-backoff would create duplicate side effects; need to add idempotency wrapper first.
- Permanent failures — retrying a 404 or 401 never succeeds; the retry is just delay. Distinguish transient (5xx, timeouts, connection-reset) from permanent (4xx other than 429) before retrying.
- Real-time / latency-bounded operations — if the caller has a hard deadline (10ms), even one retry blows the budget; fail-fast may be the right answer.
- No backoff possible — if every attempt is on the critical path (no time to wait), retries amplify the problem rather than helping; restructure the system instead.
Structure
Relationships
- idempotency — pre-requisite; without it, retries are unsafe.
- cadence — backoff schedule is a cadence pattern.
- circuit-breaker — natural termination signal when downstream is observed broken.
- backpressure — backoff is client-side response to downstream backpressure signals (429, 503).
- rate-limiting — retry-with-backoff must respect rate limits; ignoring 429s is the canonical retry-storm cause.
Examples
HTTP 429 / 503 with Retry-After · computer-science
HTTP 429 / 503 with Retry-After · computer-science
Asking-again in social contexts · sociology
Asking-again in social contexts · sociology
AWS SDK retry strategies · computer-science
AWS SDK retry strategies · computer-science
Brooker, M., *Exponential Backoff and Jitter* (AWS Architecture Blog, 2015) — the canonical modern engineering reference · computer-science
Brooker, M., *Exponential Backoff and Jitter* (AWS Architecture Blog, 2015) — the canonical modern engineering reference · computer-science
CSMA/CD Ethernet collision resolution (Metcalfe & Boggs 1976); BEB (Binary Exponential Backoff); networking literature · computer-science
CSMA/CD Ethernet collision resolution (Metcalfe & Boggs 1976); BEB (Binary Exponential Backoff); networking literature · computer-science
CSMA/CD Ethernet collision resolution · computer-science
CSMA/CD Ethernet collision resolution · computer-science
Database deadlock retry · computer-science
Database deadlock retry · computer-science
Evolutionary mutation-rate adjustment · biology
Evolutionary mutation-rate adjustment · biology
Kleppmann (2017), *Designing Data-Intensive Applications*, Chapters 8 and 11 — retry-with-backoff as a foundational distributed-systems primitive. · computer-science
Kleppmann (2017), *Designing Data-Intensive Applications*, Chapters 8 and 11 — retry-with-backoff as a foundational distributed-systems primitive. · computer-science
Metcalfe & Boggs, *Ethernet: Distributed Packet Switching for Local Computer Networks* (CACM 1976) — the original BEB pa · computer-science
Metcalfe & Boggs, *Ethernet: Distributed Packet Switching for Local Computer Networks* (CACM 1976) — the original BEB pa · computer-science
Michael T. Nygard, *Release It! Design and Deploy Production-Ready Software* (Pragmatic Bookshelf, 2007). · computer-science
Michael T. Nygard, *Release It! Design and Deploy Production-Ready Software* (Pragmatic Bookshelf, 2007). · computer-science
PhD applications / job applications · sociology
PhD applications / job applications · sociology
RFC 6585 (HTTP 429) and RFC 7231 (Retry-After) — the web-protocol-level specifications. · computer-science
RFC 6585 (HTTP 429) and RFC 7231 (Retry-After) — the web-protocol-level specifications. · computer-science
Retry-After response header, which a server may include with 429 or 503 responses to tell the client when it may safely retry — either as a delta-seconds integer or as an HTTP-date.Together these specifications give retry-with-backoff a standard protocol-level interface at the web layer: the server explicitly communicates “wait this long before trying again,” and a well-behaved client honors the value as a floor for its own backoff schedule. Ignoring Retry-After and continuing aggressive retry is the canonical retry-storm cause and a frequent source of escalations in incident postmortems. The pattern is implemented in virtually every modern HTTP client library and in the SDKs of major web platforms (AWS, Google Cloud, Stripe, GitHub).Synaptic plasticity · biology
Synaptic plasticity · biology
TCP retransmission · computer-science
TCP retransmission · computer-science