Skip to main content
computer-science family-and-consumer-science law physics transportation

Saga

Description

A saga is a long-running transaction broken into a sequence of local sub-transactions, each independently committable, with explicit compensating actions for each. If sub-step N fails, the saga runs compensations for steps N-1, N-2, …, 1 in reverse order, semantically rolling back the partially-applied work. The structural payoff is “atomic-feeling” multi-step operations across systems that can’t share a transaction boundary — microservices, multi-vendor bookings, multi-organization workflows. The diagnostic shape: a multi-step operation crosses transaction boundaries (multiple services, multiple databases, multiple external APIs), and you need “all-or-nothing” semantics but can’t get it via a single transaction. The saga replaces atomic-by-isolation with atomic-by-compensation. The compensations are not always exact inverses — you can refund a charge, but you can’t un-send an email — so the property is semantic reversal, not literal reversal. Two flavors: orchestration-based (a central coordinator drives the saga; explicit, debuggable, single point of failure for the coordinator) and choreography-based (each step publishes events; next step listens; no central coordinator, no single point of failure, but harder to reason about end-to-end).

Triggers

User-initiated: User describes a multi-step operation across systems where atomicity is needed but 2PC isn’t available or desirable. Vocabulary cues: “saga,” “distributed transaction,” “long-running transaction,” “compensating transaction,” “rollback,” “workflow,” “orchestration.” Agent-initiated: Engine notices a multi-step operation crossing service boundaries with no obvious rollback story for partial failures. Candidate inference: “this is a saga — what are the forward steps, what are the compensations for each, and who’s the coordinator?” Situation-shape signals: Multi-step operation across services / databases / vendors; partial failure is observable to users; need all-or-nothing semantics but can’t get a single transaction boundary; 2PC is operationally untenable (microservices, external APIs).

Exclusions

  • Single-database operations — local ACID transactions are simpler and stronger; don’t introduce saga complexity if you don’t have to.
  • No compensation possible — if a forward step is structurally irreversible (sent email, fired missile, public press release), the saga’s compensation story breaks. Use forward-only with explicit acceptance of partial-completion.
  • Cross-step dependencies that violate compensation order — if compensating step 2 requires step 3’s state, the reverse-order rollback doesn’t work. Saga assumes per-step compensations are independent.
  • High-frequency-low-latency operations — saga’s compensation machinery has real overhead; for high-throughput operations a different model (idempotent forward-only with eventual consistency) may dominate.

Structure

Internal structure of saga: a table of its component slots and the concepts that fill them. = a sequence of forward steps + a sequence of compensating actions + a coordinator (orchestrator or event-bus) + bookends pairing each forward with its compensation. The bookends discipline is the load-bearing structure — a forward step without a compensation is a loose end that the saga can’t unwind.

Relationships

Relationship neighborhood of saga: a graph of the concepts it connects to and the concepts it is a part of.
  • bookends — every forward step needs its compensation bookend.
  • graceful-degradation — sagas are degradation applied to multi-step transactions.
  • event-sourcing — choreography-based sagas need an event log to drive the choreography.
  • idempotency — both forward steps and compensations must be safely retryable.
  • load-bearing — the compensation chain is load-bearing for the saga’s correctness; missing compensations make the saga lossy.

Examples

Microservices order processing · computer-science

Order Service → Payment Service → Inventory Service → Shipping Service; each service has a compensation; the orchestrator drives the saga.

Travel booking workflows · transportation

book flight, then hotel, then car; if car booking fails, cancel hotel + cancel flight (within cancellation policies).
AWS Step Functions and Temporal are two production workflow platforms whose product documentation gives saga semantics first-class treatment: each defines a long-running workflow as an ordered sequence of activities, each with an optional compensating action, and provides the operational machinery (durable state, retry, timeout, exactly-once execution, history replay) that makes orchestration-based sagas practical at scale. The platforms differ in execution model — Step Functions runs state machines defined in JSON; Temporal runs durably-executable code in conventional programming languages — but both expose the saga shape as a primitive rather than something users hand-roll on top of message queues.The structural lesson is that the saga pattern, once it appears repeatedly in application code, attracts platform-level support. The infrastructure complexity required to make sagas reliable (durable execution, replay-safe code, idempotent compensations) is high enough that hand-rolled implementations rarely get it right; centralizing the machinery into a workflow platform is the dominant move.Inference: When an application starts accumulating multi-step distributed operations with explicit rollback paths, the question to ask is no longer “how do we implement this saga?” but “which workflow platform fits our deployment model?” Hand-rolled sagas built on raw message queues are an anti-pattern in mature systems; the platform layer absorbs the operational complexity the pattern requires.
deploy to staging, run tests, deploy to canary, promote to prod; each step has a rollback action.
each step (search warrant, evidence collection, lab analysis) has documented reversal protocols; chain-of-custody is the saga log.
reserve inventory, charge card, allocate shipping; if shipping allocation fails, refund card + release inventory.
Hector Garcia-Molina and Kenneth Salem’s “Sagas” paper (SIGMOD 1987) is the original work introducing the saga concept to database literature. The paper addressed long-lived transactions — transactions whose duration is too long to hold ACID locks without serializability collapsing — by decomposing them into a sequence of shorter, individually-committable sub-transactions, each paired with an explicit compensating transaction.The structural move is the substitution of atomic-by-isolation (the classical transaction model) with atomic-by-compensation: instead of holding locks until the whole long operation is done, the system commits incrementally and runs per-step compensations in reverse on failure. The paper is the canonical citation for the pattern now widely deployed in microservices architectures, multi-service workflow engines (Temporal, AWS Step Functions), and any system where the all-or-nothing semantics of an operation must be maintained across components that cannot share a transaction boundary.
Helland’s CIDR 2007 paper is the architectural argument for why large-scale systems must use the saga shape rather than distributed transactions. Coming from a career building transaction managers, his “apostasy” is the admission that two-phase commit cannot span the machines of an almost-infinitely-scaled application: you can’t guarantee any two pieces of data live on the same node, so atomic transactions can only be guaranteed within a single entity (a keyed unit of data). Across entities there is no global rollback. Helland’s prescription is therefore entity-based partitioning, asynchronous at-least-once messaging between entities (which forces recipients to be idempotent), and per-entity workflow state (“activities”) that tracks the status of multi-entity operations.This is the saga’s roles, justified from first principles. The forward steps are local transactions inside individual entities, each independently committed; there is no coordinator holding a global lock. When a later step violates a business rule, you cannot undo the already-committed earlier steps, so Helland replaces rollback with compensation — what he memorably calls “sending an apology” — and reconciliation that drives the distributed state toward eventual completion. His paper is the bridge from Garcia-Molina and Salem’s 1987 single-database saga to the distributed-cloud reality: same compensate-instead-of-rollback structure, now mandatory because the alternative (2PC at scale) is infeasible.Inference: When a workflow must span partitioned data, do not reach for a distributed transaction; Helland’s argument says it will not scale and you will end up bypassing it anyway. Model each step as a local commit within an entity, make every inter-step message idempotent, and define a compensating action (“apology”) for each forward step. The price of dropping 2PC is that you must design the undo path explicitly — which is precisely what the saga pattern is.
Kleppmann’s Designing Data-Intensive Applications (2017) discusses sagas in Chapter 9 (consistency and consensus) as part of the modern engineering framing of distributed transactions. The chapter situates the saga pattern in the broader landscape — alongside two-phase commit and consensus protocols like Paxos/Raft — and articulates the tradeoffs: two-phase commit gives atomicity at the cost of blocking and coordinator dependence, sagas give liveness and decoupling at the cost of weaker isolation and the need to design compensating actions.The book’s catalog role is to ground the saga concept in the standard distributed-systems vocabulary, so that the catalog’s structural framing of saga composes cleanly with how practitioners already think about it as one of a small set of choices for cross-service transaction shapes.
set up apparatus, calibrate, run, collect data; each setup step has a teardown step for clean failure.
Chris Richardson’s Microservices Patterns is the canonical treatment of how to build transactional systems when a single ACID boundary is no longer available. Once a logical operation crosses multiple services with private databases, two-phase commit becomes operationally untenable — blocking, single-point-of-failure on the coordinator, and incompatible with cloud-native autonomy. Richardson presents the saga pattern as the principled alternative: decompose the long-running transaction into a sequence of local transactions, each independently committable in its own service, with explicit compensating transactions defined for each forward step. If step 4 fails, the saga runs the compensations for steps 3, 2, 1 in reverse order, semantically rolling back the partially-applied work without ever holding distributed locks. The book also distinguishes the two coordination styles — orchestration (a central saga-orchestrator service drives the sequence) versus choreography (each service publishes domain events; the next service listens) — and lays out the tradeoffs (orchestration is explicit and debuggable but reintroduces a coordinator; choreography is fully decentralized but harder to reason about end-to-end).Inference: When a multi-step operation crosses service boundaries and partial failure is observable to users, the diagnostic move is to ask “what is the compensating action for each forward step, and is it reachable?” If any forward step has no semantic inverse (a sent email, a fired missile, a public press release), the saga’s correctness story breaks at that step and the design must either move the irreversible step to the end, accept partial-completion as a normal terminal state, or front the operation with a confirmation gate. The compensations being semantic (not literal) inverses is also load-bearing: you can refund a charge but not un-charge it, you can issue a public correction but not unprint a press release — the saga’s “atomic-feeling” property degrades to “atomic-enough” wherever the inverse is approximate.
deposit-cascade with cancellation deadlines and forfeit policies; each vendor has a documented compensation schedule.
modern engineering instantiation lineage — the saga pattern has been re-discovered as the dominant transaction model in microservices architectures where 2PC is operationally untenable