Skip to main content
computer-science

Race condition

Description

Two parallel paths interact through a shared resource without coordinating their order, so the observable outcome depends on relative timing rather than program logic. The classic example: thread A reads a counter, computes counter+1, writes it back; thread B does the same. If both reads happen before either write, one increment is lost. The bug is not in either path’s code; it’s in the assumption that the two could be sequenced safely without explicit coordination. The structural shape — two paths + shared container — is the right diagnostic. Many concurrency bugs that look like “weird timing” are races; a non-trivial fraction of distributed-system bugs are races writ large (multi-node, eventually-consistent state). The fix surface is well-developed: synchronization primitives (mutexes, atomics), ordering protocols (consensus, vector clocks), or structural prevention (immutability, single-owner state).

Triggers

User-initiated: User describes “weird timing bugs,” “flaky tests,” “only reproduces under load,” “race condition.” Vocabulary cues: “race,” “concurrent,” “lock,” “atomicity,” “deadlock,” “thread safety,” “interleaving.” Agent-initiated: Agent notices that two paths touch the same state without explicit ordering, OR that observed behavior depends on which of two operations completes first. Candidate inference: “what’s the shared container; what ordering assumption is implicit?” Situation-shape signals: Bugs that don’t reproduce deterministically. “Works in dev, fails in prod.” Test failures with hand-wavy “rerun and they pass.” Performance increases that cause new bugs to surface.

Exclusions

  • Single-threaded systems — no concurrency, no race. The concept doesn’t fire even if code has shared mutable state.
  • Commutative operations on shared state — if both paths perform operations whose outcome doesn’t depend on order (e.g., set union, max), the race exists but is not observable as a bug.
  • Eventually-consistent semantics by design — some systems explicitly accept races as part of their consistency model (CRDTs, gossip protocols); naming “race” then is a label for behavior the design chose, not a bug.
  • Sequential causality — operations that have happened-before relationships established by the program don’t race; the ordering is enforced by causation, not just timing.

Structure

Internal structure of race-condition: a table of its component slots and the concepts that fill them.

Relationships

Relationship neighborhood of race-condition: a graph of the concepts it connects to and the concepts it is a part of.
  • seam — races live at concurrency-context seams; the boundary between assumed-atomic and actually-atomic.
  • container — the shared container is one of race-condition’s constitutive primitives.
  • make-wrong-unrepresentable — the strongest race fix is structural absence of the dangerous interleaving (immutable data, single-writer rules, transactional boundaries).
  • asymmetric-gate — synchronization primitives are asymmetric gates: cheap to acquire-then-release in the common case, expensive when contention is high.
  • load-bearing — diagnostic: is the race load-bearing (must be fixed) or benign (rare + low-stakes)?

Examples

Lost-update counters · computer-science

the canonical concurrent-increment bug; resolved with atomic operations.

Check-then-act (TOCTOU) · computer-science

“file exists” check followed by “open file” can race with another process deleting it; the time-of-check-vs-time-of-use gap.
two user actions trigger async updates to the same store; whichever response returns later “wins.”
populated-cache-then-update-source: readers between the events see stale data; not a bug per se but indistinguishable structurally.
Coffman, Elphick, and Shoshani’s 1971 Computing Surveys paper “System Deadlocks” gave the canonical characterization of deadlock as the simultaneous holding of four conditions: mutual exclusion (resources held in exclusive mode), hold-and-wait (a process holds a resource while waiting for another), no preemption (a held resource cannot be forcibly taken), and circular wait (a cycle in the wait-for graph). Deadlock occurs when and only when all four conditions are present; removing any one condition prevents the failure mode.The structural insight is that the failure is not a property of any single thread’s behavior but of the combined state of concurrent threads and the resources they contend for. A correctly-coded thread in isolation can participate in a deadlock — the bug emerges from the topology of contention, not from any local mistake. This is the same diagnostic shape as race conditions more broadly: the failure is at the seam between independent paths, not inside either path.Inference: To analyse or eliminate concurrency bugs, ask which of the four conditions can be structurally prevented in the system at hand. Lock-free designs attack mutual exclusion; resource-ordering protocols attack circular wait; timeouts and preemption attack no-preemption; two-phase locking with all-up-front acquisition attacks hold-and-wait. The condition framework converts “fix the race” into a precise question of which structural property of the contention topology to break.
race-adjacent: both paths hold a resource the other needs and won’t release; ordering matters but neither path can proceed.
Leslie Lamport’s 1978 Communications of the ACM paper observed that in a distributed system without a shared clock, two events on different nodes cannot be ordered just by reading wall-clock timestamps — relativistic intuition applies: if event A on one node cannot have caused event B on another (no message passed between them in the available time), then their order is not just unknown, it is not a fact about the system. Lamport defined the happens-before relation: A happens-before B if A is earlier on the same node, or if A is the send of a message that B receives, or by transitive closure of these. Events that are not happens-before-ordered are concurrent — and any algorithm that depends on their order has a race condition. The paper then introduced logical clocks that respect the happens-before relation, and showed they suffice for state-machine replication, the foundation of essentially every modern consensus protocol.Inference: Single-machine race-condition intuition (two threads contending for a counter) generalizes precisely to distributed systems, but only once “shared timing” is replaced by “happens-before-causality.” The diagnostic question for any multi-node operation is: is the ordering being relied on actually established by a happens-before chain, or is it being inferred from wall-clock timestamps that the system has no right to trust? The latter is a distributed race. The remedy is the same in spirit as for single-machine races — make the ordering structurally enforced rather than coincidental — but the mechanism is different: logical clocks, vector clocks, consensus, or causal-consistency protocols, rather than mutexes.
Race condition is a foundational primitive in concurrency, with a deep literature in both operating-systems and distributed-systems. Coffman, Elphick, and Shoshani’s 1971 paper on system deadlocks formalized the conditions under which concurrent processes contending for shared resources can stall each other indefinitely (mutual exclusion, hold-and-wait, no preemption, circular wait) — a structural taxonomy of one race-condition failure mode. Lamport’s 1978 paper “Time, Clocks, and the Ordering of Events in a Distributed System” lifted the same ordering problem from single-machine concurrency to distributed settings, introducing the happens-before relation as the diagnostic for when two events genuinely race versus when causation has already sequenced them.The conceptual machinery built on this lineage — TOCTOU (time-of-check to time-of-use) bugs, atomicity, serializability, linearizability, memory models — is now field-standard across systems engineering. The persistence of the same structural shape across mechanisms (mutexes, atomics, transactions, vector clocks, consensus) is what makes race condition portable as a primitive: the contended container changes, the coordination primitive changes, but “two paths converge on shared state without coordination” is the diagnostic that recurs.
The two standard operating-systems textbooks give the race condition its canonical undergraduate definition. Silberschatz’s Operating System Concepts defines a race condition as a situation where several processes access and manipulate shared data concurrently and the outcome depends on the particular order in which the accesses interleave; it motivates the idea with a shared counter variable whose machine-level load/increment/store steps can interleave between processes so that an increment is silently lost. Tanenbaum’s Modern Operating Systems gives the same shape through the print-spooler example: two processes each read the next free slot in the spooler directory, both see the same slot, and both write — so one job’s filename overwrites the other’s. In both texts the diagnosis is identical: the failure is not in either process’s code but in the unguarded interleaving of two paths through shared state.The textbooks then promote the diagnosis into the critical-section problem: the segment of each process that touches the shared resource is its critical section, and a correct solution must enforce mutual exclusion (only one process in its critical section at a time), progress, and bounded waiting. This is the standard pedagogical move from “here is a timing bug” to “here is the structural property that prevents the whole class of timing bugs.”Inference: The two-paths-plus-shared-container shape is exactly the textbook framing, and the textbook fix surface is the one to reach for: identify the critical section (the span where both paths touch the contended resource), then enforce mutual exclusion over it — via mutex, semaphore, monitor, or, better, by restructuring so the contended state has a single owner and the dangerous interleaving cannot be represented at all.
multi-node race where one node observes pre-update state while another already has post-update.