Skip to main content
computer-science linguistics performing-arts psychology

Chunking

Description

The active grouping of items into larger units to compress representation. Miller’s classic finding — human working memory holds roughly 7±2 chunks — established that the chunk, not the atomic item, is the cognitive unit of capacity. A skilled chess player remembers board states as chunks (familiar configurations) rather than 32 piece-positions; a skilled reader processes words rather than individual letters. The concept generalizes far beyond cognition: any system that hits a per-unit overhead cost can win by chunking — networking (packet headers amortize over the payload), databases (batch inserts amortize transaction cost), distributed systems (paged memory amortizes cache-miss cost), ML training (mini-batch gradients amortize per-step overhead). The chunking criterion (what goes together) is the design choice; chunk size is the dial. Distinct from grain: grain is the chosen level of resolution — the “we’re operating at word-level, not letter-level” decision; chunking is the active move that produces operation at that level. Grain is the property; chunking is the action.

Triggers

User-initiated: User describes grouping items together for efficiency, or asks about batch size / granularity tradeoffs. Vocabulary cues: “chunk,” “batch,” “group,” “7±2,” “working memory,” “page,” “packet,” “batch size.” Agent-initiated: Agent notices a per-unit overhead cost in a system processing many small items, and considers whether chunking would amortize the overhead. Candidate inference: “what’s the per-unit cost; would chunking N items together amortize it; what’s the right chunk size?” Situation-shape signals: Per-unit overhead costs that scale linearly with item count. Working-memory or cache-capacity constraints. Streaming vs. batching design questions. Pagination, packetization, batch-size decisions.

Exclusions

  • Per-item processing has zero overhead — when there’s no fixed cost per processing step, chunking doesn’t amortize anything and adds boundary-handling complexity for no gain.
  • Items must be handled with low latency individually — real-time control, high-frequency trading, single-key-press feedback; chunking adds latency that violates the requirement.
  • Chunk boundaries cut natural units awkwardly — chunking by arbitrary rule rather than meaningful criterion produces fragmentation costs that exceed amortization gains.
  • Chunk size already at optimum — over-chunking past the diminishing-returns point (e.g., batch size already at memory capacity) doesn’t help and can hurt.

Structure

Internal structure of chunking: a table of its component slots and the concepts that fill them.

Relationships

Relationship neighborhood of chunking: a graph of the concepts it connects to and the concepts it is a part of.
  • grain — chunking produces a grain-shift; the chunk size is the new grain. Co-occurs in any discussion of “what level are we operating at.”
  • uniformity-dividend — uniform chunk size enables downstream uniformity; chunked-but-variable-size loses the dividend. The MTU / page-size / batch-size standardization is the uniformity-dividend lens on chunking.
  • flow — chunking discretizes a flow; chunked flows have different throughput, latency, and back-pressure characteristics than continuous flows.
  • cadence — chunking interacts with cadence; the chunk-arrival rhythm is a temporal shape that downstream processes must accommodate.
  • seam — chunk boundaries are seams; the question “what goes together vs. apart” is the seam-placement question.

Examples

Phone-number memorization · psychology

555-867-5309 is three chunks (area code, prefix, suffix); the dashes are the chunking criterion that makes the 10 digits fit in working memory.

Network packet structure · computer-science

payload chunked into MTU-sized packets; the chunk is the transmission unit and the per-packet header is amortized over the payload.
Chase and Simon’s 1973 chess-perception studies are the empirical demonstration that chunking is the unit of expertise, not memory raw capacity. They showed master-level chess players board positions for five seconds, then asked them to reconstruct the position from memory. On real positions drawn from actual games, masters reconstructed roughly 24 pieces with near-perfect accuracy after a single five-second exposure — far beyond the 7±2 limit Miller’s classic working-memory work would predict. On random positions (pieces placed at random on the board, preserving only the piece-count and color-distribution), the same masters performed barely better than novices, reconstructing only 5-7 pieces.The differential is the diagnostic. Masters were not seeing 24 individual pieces on real positions; they were seeing 4-7 chunks (a kingside castle formation, a typical Sicilian Najdorf pawn structure, an isolated queen-pawn position with characteristic minor-piece placements), each chunk addressable as a single unit. On random positions, the pattern-matching machinery had nothing to attach to and the master’s effective capacity collapsed back to the 7±2 raw-item limit. The study established that chess expertise is the acquisition of a large vocabulary of recognizable chunks (modern estimates: 50,000-100,000 for grandmaster-level play) rather than greater raw working-memory capacity.Inference: The pattern exports broadly. Reading speed gains in adults reflect chunking from letter-by-letter to word-by-word to phrase-by-phrase recognition; programmer expertise reflects chunking from token-by-token to idiom-recognition; clinical-diagnostic expertise reflects chunking from symptom-by-symptom to syndrome-recognition. In every case, the apparent “more capacity” is actually a richer vocabulary of chunks. The implication for skill acquisition is that deliberate practice should target chunk-vocabulary growth, not generic working-memory training.
chunked into familiar formations rather than 32 individual piece positions; classic Chase & Simon (1973) result.
Nelson Cowan’s 2001 Behavioral and Brain Sciences paper revised Miller’s classic 7±2 working-memory capacity estimate downward to roughly 4 chunks, based on tasks that controlled for rehearsal, grouping, and long-term-memory contributions. Miller’s original 1956 estimate included sub-vocal rehearsal effects that allowed participants to refresh items before they decayed; Cowan’s “focus of attention” methodology used tasks (running-memory-span procedures, irrelevant-speech paradigms, the change-detection task introduced for working-memory measurement) that prevented these strategies and produced a more conservative capacity estimate.Cowan’s contribution to the chunking primitive is the tighter bound plus the conceptual reframing: working memory is not a passive store of size N but the limited focus of attention at any moment, and what fits in that focus is a small number of currently-active chunks. The chunk is the unit, and the chunking-criterion (how items are grouped into chunks) is what determines whether a task fits within the bound. Tasks involving 7-9 raw items are not impossible — they require deeper chunking, and chunking takes time, attention, and prior structure to construct.Inference: The downward revision sharpens design implications wherever working-memory load matters — UI design (number of concurrent options), pedagogy (number of concepts introduced in one lesson), API design (number of arguments before a function becomes cognitively expensive), code review (number of changes a reviewer can hold in mind). The conservative 4-chunk bound is the diagnostic threshold; designs that require more chunks-in-active-attention should either reduce the count or provide explicit external scaffolding (notes, diagrams, persistent UI state) to offload some of the load.
batching N inserts into one transaction amortizes commit cost across N rows; chunk size trades latency against per-row cost.
phonemes group into syllables, syllables into words, words into phrases; each level is a chunking step that lets working memory hold more meaningful content at higher levels.
Miller’s 1956 paper in Psychological Review synthesized evidence from absolute-judgment tasks (how many discriminable values can a participant identify along a single sensory dimension) and immediate-memory-span tasks (how many items can be reliably recalled after one presentation), and noticed that the two converged on the same 7±2 range despite measuring superficially different things. The cognitive system gets around this ceiling by chunking: a sequence of low-information items (individual digits, individual letters) is reorganized into a smaller number of higher-information units (a familiar phone number, a known word), each of which counts as a single item against the seven-item ceiling. The classic illustration: the letter sequence “B M W A B C I B M” exceeds the 7±2 bound and is hard to recall; the same letters re-grouped as “BMW ABC IBM” fit comfortably as three chunks because each group is a familiar pattern in long-term memory that can be addressed as a single unit.This is the canonical articulation of chunking as a cognitive primitive: the ceiling is a property of the channel (working memory), and chunking is the structural maneuver that fits more content into the same number of slots. A novice memorizing 12 random digits exceeds capacity and fails; an expert recognizing the same digits as three known telephone area codes uses three slots and succeeds. Later refinements (Cowan’s 2001 argument for a tighter bound of ~4) sharpened the number but preserved the structural shape Miller named: working memory operates on chunks, not on raw items, and what counts as a chunk depends on what the perceiver already knows.Inference: the load-bearing maneuver for performance in any working-memory-bounded task is restructuring the input so each chunk packs more — chess masters chunk board positions into known patterns; programmers chunk code into named abstractions; musicians chunk notes into phrases. The chunking re-uses prior knowledge to compress current input, which is why expertise looks like increased capacity but is actually increased per-chunk content.
gradients computed over a mini-batch rather than per-example (full SGD) or per-epoch (full-batch GD); chunk size is the hyperparameter that interpolates the tradeoff.
notes group into motifs, motifs into phrases, phrases into sections; the listener processes phrases, not notes.
bytes grouped into pages (4KB typical); the page is the OS’s unit of allocation, swap, and protection.
Tanenbaum and Bos’s Modern Operating Systems (§3.5.3, “Page Size”) presents paged virtual memory as a designed act of chunking. Rather than tracking memory byte-by-byte, the OS groups address space into fixed-size pages and treats each page as a single indivisible unit for mapping, swapping, and protection. The page size is the chunk size, and the textbook lays out the tradeoff that choosing it involves: small pages waste less memory to internal fragmentation (the leftover in the last partly-filled page) but require larger page tables and give the TLB less “reach”; large pages shrink the page table and improve TLB hit rates but waste more on fragmentation. Tanenbaum even gives the classic optimization, page size ≈ √(2·s·e) for average process size s and page-table-entry size e.This is chunking in its computational form: an active grouping of items into larger units that produces a unit-shift in how downstream processes treat the content. Once memory is paged, the hardware and OS reason about pages, not bytes — the TLB caches page translations, the pager swaps whole pages, permissions are set per page. The grouping changes the granularity of every operation above it. And the page-size tradeoff is the general chunking tradeoff made quantitative: bigger chunks compress the bookkeeping (fewer units to track) but coarsen the resolution (more waste, less precise handling), so the right chunk size is the one that balances overhead against fragmentation for the expected workload.