Rewind Time: The 60-Second Buffer for the Supervisor Era
Callipso records the last 60 seconds of its own runtime so you can hand any bug to an agent. The product, the engineering, and the cognitive science behind it.
Rewind Time: The 60-Second Buffer for the Supervisor Era
There is a button in the corner of Callipso. When you press it, the last 60 seconds of everything the overlay just did — encoded video, DOM events, console logs, every IPC call, the full state tree, every user click with coordinates and target descriptor — freezes into a scrubbable bundle. You scrub backwards. You stop on the exact frame the bug appeared. You narrate over it: see this — here is what I expected, here is what happened. You hand the bundle to a Claude Code agent.
The agent does not ask you to reproduce. The bundle is the reproduction. It selectively reads what it needs — the DOM mutation that broke the layout, the console error two seconds earlier, the IPC call that timed out — from a stream that was already running before you knew you would need it.
That is the product. The rest is consequence.
The product in one diagram
The audit that triggered this post
A 60-second buffer is only as good as the events it captures. Until this week our click capture relied entirely on rrweb's MouseInteraction event stream. We audited a real snapshot. In a 60-second window with five user clicks, rrweb captured one. The other four were on native <video controls> shadow DOM (intercepted before bubble), on scrollbars (no DOM event fires), and across separate BrowserWindows (the recording window, plan popup, and mini recorder are independent renderers).
The fix is a parallel click stream captured at document capture phase, recording {ts, x, y, button, target} for every pointerdown. Capture-phase fires before any element handler can call stopPropagation, before native video controls intercept, and before shadow DOM swallows the event. We shipped it through the pre-existing userflowSteps slot on self:snapshot-seal, so no backend schema change was needed. The player now uses this stream as the authoritative click source and renders each click with coordinates: CLICK @ 234,567 · button#sst-pb-next — Next.
The engineering took an afternoon. The interesting part is what the buffer actually does for the human using it.
What it dissolves
A trilogy is forming. Each era ends when its bottleneck becomes uneconomic to optimize:
| Year | Era | Who coined it |
|---|---|---|
| 2023 | Prompt engineering | Industry-wide |
| 2025 | Context engineering | Karpathy, Lütke |
| 2026 | Attention engineering | We are coining it |
Prompt engineering becomes accurate by default. Your narration was written with the bug paused on screen, not from memory. "It does this thing sometimes" becomes "at +12.4s the rrweb pane stopped receiving mutations — here is the frame." Prompt engineering existed because models were brittle to phrasing. They are not anymore, and the buffer removes the last reason to phrase defensively.
Context engineering becomes free for the ephemeral slice. The agent does not have to be fed the right slice of state — the slice was captured by the ring buffer, and the agent picks. Karpathy's framing of "LLM is the CPU, context window is RAM, you are the OS" applies precisely here, except now the OS has a dmesg it did not have to remember to run. Durable context (codebase knowledge, conventions) is still hard. Ephemeral context (what just happened in the running app) is solved.
Attention engineering is the next lever. This is the new term. Designing the substrate so a human can hand off, walk away, and re-enter without paying the residue tax. The 60-second buffer is one primitive in this class. So is the validation gate. So is the before/after video. None of them are about making the model smarter. All of them are about making the human's supervisory loop cheaper.
The verification gap
Coding LLMs have collapsed generation to near-instant. They have done very little to address verification. A person still has to stare at results and discriminate if they are good.
— Andrej Karpathy, June 2025, riffing on Balaji Srinivasan
He is right. Generation is free. Discrimination is the bottleneck. The before/after video is the verification primitive — discrimination through your eyes instead of through a diff. The buffer is what makes the before half exist. The validation gate is what makes the after half non-fakeable.
Naming what the buffer actually does
Almost every mechanism the buffer attacks has been studied in cognitive psychology, human-factors engineering, or supervisory-control literatures going back fifty years. Using the right vocabulary changes what you can claim — and what reviewers will accept.
Resumption lag, not switch cost
The most common framing for the cost of context-switching is switch cost — Stephen Monsell's 2003 Trends in Cognitive Sciences paper is the foundational citation. But switch cost is symmetric: it covers both the act of leaving a task and the act of returning to one. The buffer attacks the return side specifically, so the more precise term is resumption lag, from Altmann and Trafton's 2002 Memory for Goals model.
Resumption lag is the time between deciding to return to a previously-interrupted task and being productive again on it. It scales with how much state the human has to reconstruct mentally. The buffer reduces that reconstruction to zero for the most recent task: you do not reconstruct, you replay.
| Term | Source | What it names |
|---|---|---|
| Switch cost | Monsell 2003 | Generic latency penalty on any task switch (symmetric: leave + return) |
| Resumption lag | Altmann & Trafton 2002 | Time between deciding to return and being productive again |
| Restart cost | Allport, Styles & Hsieh 1994 | Cost specific to resuming after a pause |
| Reconfiguration time | Rogers & Monsell 1995 | Mental reset required between task-sets |
| Task-set inertia | Allport et al. 1994 | Prior task's mental configuration persists and interferes |
| Goal-activation decay | Altmann & Trafton 2002 | Why "what was I doing?" gets harder over time |
Calling the buffer a resumption-lag reducer is more honest than calling it a switch-cost reducer. We do not reduce the cost of leaving a task — that lives in the validation gate, which is a separate primitive. We reduce the cost of returning to one.
Attention residue and the Zeigarnik effect
Sophie Leroy's 2009 paper in Organizational Behavior and Human Decision Processes coined attention residue: the cognitive carryover that lingers when you switch away from an unfinished task. The central empirical finding, replicated repeatedly, is that incompleteness is the strongest predictor of residue. A closed task releases its hold on working memory. An open task continues to leak.
The mechanism is much older. Bluma Zeigarnik observed in 1927 that waiters could recall details of unpaid orders far better than paid ones — the Zeigarnik effect. Unfinished tasks occupy memory disproportionately; completion releases the load.
This is why the validation gate matters as much as the buffer itself. The gate forces real closure: the agent must reproduce the bug from the captured session, fix it in the codebase, replay the captured flow past the failure point, and record a fresh after-video on the live app. A green CI run is not enough. The gate triggers the Zeigarnik release. The buffer reduces resumption lag when you come back to the next task. Together they cut switch cost on both ends.
The composite label is a closure gate — a tool that compels completion via objective replay rather than self-reported "done."
Context reinstatement and the diving experiment
Why does watching ten seconds of "what I was doing when I left" reload the task in your head better than reading a written summary?
Tulving and Thomson's 1973 encoding specificity principle says recall is best when retrieval cues match encoding cues. The most famous demonstration is Godden and Baddeley's 1975 scuba diver experiment: subjects who learned word lists underwater recalled them better underwater than on land, and vice versa. The encoding context — pressure, temperature, peripheral vision — became part of the memory trace.
The before/after video works on the same principle. The cues you encoded while debugging — the exact frame, the cursor position, the running log message at +12.4s, the IPC call that timed out — are reinstated when you replay. Episodic memory cues fire. The task reloads through your eyes, not from working memory.
The right term is cued context reinstatement. This is also the cleanest argument against summary-based handoffs. A textual summary loses the cues. The video keeps them.
Cognitive offloading and the extended mind
Risko and Gilbert's 2016 Trends in Cognitive Sciences paper formalized cognitive offloading: externalizing memory or computation to tools to reduce the cognitive load you would otherwise carry. Notebooks are cognitive offloads. Calendars are cognitive offloads. Our 60-second buffer is a cognitive offload for episodic memory.
The philosophical version is older: Clark and Chalmers' 1998 extended mind paper argued that a notebook reliably carried and consulted is functionally part of the cognitive system. The buffer is part of the system in exactly that sense — once you have internalized that the last 60 seconds is always retrievable, you stop trying to remember it. Working memory is freed for the thing you are actually doing.
Edwin Hutchins' 1995 Cognition in the Wild generalized this into distributed cognition: the cognitive unit is the human plus tools plus workspace, not the human alone. The bug you are debugging is not solved by your brain alone; it is solved by your brain plus the buffer plus the agent pattern-matching across the captured bundle. This is also why our dev harness matters — without machine-readable access to runtime state, the agent cannot participate in the distributed cognition loop.
Supervisory control: the underused angle
Here is the literature almost nobody in the AI agent discourse cites: supervisory control. Tom Sheridan and William Verplank's 1978 framework was developed for one human overseeing multiple automated agents — autopilots, drone pilots, nuclear plant operators, air traffic controllers. It is precisely the problem of supervising AI agents, restated thirty years early.
| Term | Source | What it names |
|---|---|---|
| Situation awareness (SA) | Endsley 1995 | Three levels: perception, comprehension, projection. The buffer rebuilds Level 1+2 cheaply. |
| Out-of-the-loop performance problem | Endsley & Kiris 1995 | Supervisors degrade at intervening because they have lost SA. The classic parallel-agents failure mode. |
| Automation complacency | Parasuraman & Manzey 2010 | Operators stop verifying when automation seems reliable. Karpathy's verification gap, named earlier. |
| Levels of automation | Sheridan 1992 | Ten-level taxonomy from manual to fully autonomous. The framework AI agent supervision lives inside. |
When parallel agents touch the same files, the out-of-the-loop problem is exactly what bites. By the time one agent needs you, you have lost SA on it. The buffer is an OOTL counter-measure.
Metacognitive calibration
METR's 2025 randomized controlled trial of experienced developers using AI tools found a 39-point perception gap: subjects were 19% slower than the control, but believed they were 20% faster. This is a metacognitive calibration failure in the sense of Lichtenstein, Fischhoff and Phillips' 1982 work on confidence calibration. The believed-vs-actual gap is the central metric.
The buffer replays ground truth. It is, in Lichtenstein's vocabulary, a metacognitive calibration scaffold. You can no longer tell yourself "that worked" if the playback shows it did not. You can no longer tell yourself "I was productive" if the captured session shows three false starts and a self-interrupt. The data breaks the illusion.
Prospective memory offload
Einstein and McDaniel's 1990 work on prospective memory — remembering to do something in the future — identifies a specific failure mode in interrupted work: the intention to return to a task is itself a memory burden, and that burden grows with the number of open intentions. When the validation gate enforces closure on a snapshot, "remember to come back to that bug" stops being a prospective-memory burden. The intention is materialized as an artifact on disk. This is prospective memory offload.
Parallel is the wrong word
The dev world says "I am running five agents in parallel." Marketing language; the science says otherwise.
True parallel attention means executing two tasks in the same instant — the supertasker driving while doing mental math. Only 2.5% of humans can actually do it (Watson & Strayer 2010). The rest of us are doing sequential task switching: one tab at a time, paying a switch cost on every jump.
And the unit is not "an agent." A feature is not one prompt — it is 10 to 20 small turns inside one tab, because every context-rot study lands on the same number: every extra 500 tokens costs roughly 5 percentage points of accuracy, and even 200k-context models degrade past 2 to 3k of relevant content. Big prompts hallucinate. Small prompts compose. So multi-tab work is many stacks of small turns, with switch cost living between the stacks.
The buffer attacks both sides of every switch:
| Side | Mechanism | What it does |
|---|---|---|
| Leave | Validation gating | The agent does not get to say "done." It must reproduce the bug from the captured session, fix it, replay the captured flow past the failure point, and record a fresh after-video. That gate cannot be faked with a green CI run. |
| Return | Before/after video | You watch 10 seconds of what I was doing, then 10 seconds of what it does now on the live app. The context reloads through your eyes. Re-entry collapses from 23 minutes to 23 seconds. |
Leave easy. Come back easy. The switch cost shrinks at both ends.
The trap nobody names
Five "open" tabs does not equal five-times-focus. It equals quarter-on five things. Linda Stone called this Continuous Partial Attention twenty years ago. The AI era industrialized it: each agent looks "almost done," so you check on it, so you self-interrupt, so you switch.
| Number | What it measures | Source |
|---|---|---|
| 19% slower | Experienced devs using AI tools — but believed they were 20% faster (39-point perception gap). | METR 2025 RCT |
| 23 minutes | Time to recover full focus after a single interruption. | Mark, Gudith & Klocke 2008 (CHI) |
| 2.5% | Of humans are true multitaskers. You are almost certainly not one. | Watson & Strayer 2010 |
The lever is not "open more sessions." It is close more sessions, faster. The buffer makes closure cheap. The gate makes closure trustworthy.
Two modes — your call
Five months of watching our own sessions, two distinct rhythms emerged:
| Mode | When | Why the buffer helps |
|---|---|---|
| Fan-out mode | Bug-fixing day. Many small tasks, dozens of commits, four tabs cycling. | Switch cost dominates. Rewind shines here: cheap closure, cheap return, you stay productive across the fan. |
| Deep mode | Building a new system. One tab, no parallelism. | You sit with the LLM's long reasoning rather than switching away from it. The buffer is mostly idle but the validation gate still enforces honest completion. |
Both are legitimate. Which one wins depends on what you are building, not on the tool. Callipso does not nudge you toward either. It gives you the substrate so the choice is yours, and lowers the friction so neither mode gets punished.
The product in one line: decrease residual attention through hard validation gates and a before/after video, so the user controls the arbitrage.
The composite framing
The strongest single sentence claim:
The 60-second buffer is a supervisory-control primitive that reduces resumption lag through cued context reinstatement, releases attention residue via a closure gate, externalizes episodic memory through cognitive offloading, and serves as a metacognitive calibration scaffold against automation-induced out-of-the-loop performance decay.
Each clause cites a distinct tradition. Each is independently defensible. None of them are novel as concepts — every term is from a peer-reviewed paper, mostly from before LLMs existed. What is new is the composition: putting all of these mechanisms inside a single product surface, sized for handoff to an LLM agent.
FAQ
What does the rewind buffer actually capture?
The last 60 seconds of: encoded H.264 video of the overlay window (8 Mbps, up to 4K, hardware-accelerated via VideoToolbox), every DOM event via rrweb, every console log, every IPC call across our 480 channels, the full state tree snapshot at seal time, and every user click with coordinates plus target descriptor. The bundle is bounded to 5 snapshots per boot, so disk use stays under ~50MB.
Why a fixed 60-second window instead of unlimited recall?
A 60-second cap forces curation. Rewind.ai and Microsoft Recall index everything you ever do, which solves a different problem (search across months of activity). Our problem is debugging the last minute of overlay behavior to hand to an agent. Capping at 60 seconds keeps the bundle small enough to be a reasonable LLM context, keeps disk cost negligible, and matches the cognitive-science evidence on how long episodic detail survives in working memory before consolidation reshapes it.
Why is the click capture separate from rrweb?
rrweb attaches MouseInteraction listeners on document in capture phase, which should see every click. In practice it misses clicks on native <video controls> shadow DOM, on scrollbar drags, and across separate BrowserWindows. Our parallel click capture also listens at document capture phase — same dispatch point — but is independent of rrweb's internal filtering and runs even when rrweb is paused. In our audit we caught 5x more clicks with the parallel capture.
How is this different from session-replay tools like Sentry or LogRocket?
Session-replay SaaS captures DOM-only for production debugging, indexed by user. Our buffer captures DOM plus video plus IPC plus state plus clicks, scoped to the developer's own running app, sized for an LLM agent rather than a human reviewer. The closest single ancestor is Replay.io for browsers — but Replay does not cover the Electron + IPC + state surface we need for parallel agent supervision.
Is "attention engineering" actually a real term?
Not yet. We are coining it. The concept is real and is the natural successor to prompt engineering and context engineering. The label is fragile and might lose to "supervision UX," "closure tooling," or "verification primitives." The 60-second buffer is evidence for the term, not the term itself.
Where We Are Ahead
Multi-stream capture sized for LLM handoff is the differentiator. The closest products either capture one stream only (Rewind.ai is screen-only, rrweb is DOM-only) or are designed for human review rather than agent debugging (Sentry, LogRocket, Replay.io). Stacking video + DOM + console + IPC + state + clicks into one scrubbable bundle is what makes the agent able to selectively read the slice it needs without us having to anticipate which slice that is.
Naming each mechanism in the original cognitive-science vocabulary is a positioning advantage. The AI agent discourse is currently rediscovering supervisory control and situation awareness without citation. Anchoring our product description into Sheridan, Endsley, Leroy, Tulving, and Altmann & Trafton gives the claim a defensible foundation that surface-level marketing copy cannot match.
Where We Are Behind
The buffer is overlay-only. Clicks on the recording window, the plan popup, and the mini recorder are still missed because each is a separate Chromium process and we have not yet injected the click capture into them. The fix is mechanical (inject the same pointerdown listener into every BrowserWindow's preload) but not yet done.
Clicks outside Callipso itself are invisible. If the bug involves Slack, Finder, or system dialogs, the buffer misses them. Peekaboo v3 gives us a path to OS-level click capture via CGEvent observation but we have not integrated it.
The validation gate is described in the framing but not yet shipped as a hard enforcement layer. Today the agent is asked to "reproduce and verify;" tomorrow the system should refuse the "done" signal until reproduction + after-video are on disk.
What We Learned
Three observations that surprised us:
First, rrweb's MouseInteraction stream is dramatically lossier than the documentation suggests in any UI that uses native shadow-DOM controls. Anyone building session replay on top of rrweb for a Chromium-based UI should expect to layer a parallel capture for clicks.
Second, naming matters more than we expected. Saying "the buffer reduces switch cost" felt true but was too vague to defend. Switching to "the buffer reduces resumption lag specifically, while the validation gate releases attention residue specifically" sharpened the product story and exposed which features were doing which job.
Third, the supervisory-control literature is the most under-cited corpus in the AI agent discourse. Endsley's situation awareness model, Sheridan's levels of automation, and Parasuraman's work on automation complacency are direct predecessors of every parallel-agent UX problem the field is currently relitigating. We expect this to change as the field matures.
Software 3.0 was prompts. The supervisor era is time.
Software 1.0 was code. Software 2.0 was weights. Software 3.0 was prompts. The supervisor era is time — leaving a session and coming back to it without paying the residue tax. The 60-second buffer is one primitive in attention engineering, the class of tools that compound by making the human's supervisory loop cheaper rather than making the model smarter. The literature has been waiting. We are filling the product surface that was missing.
Try it in Callipso and tell us what we got wrong.
Sources
- Karpathy on the verification gap (June 2025)
- Karpathy — Verifiability essay
- Karpathy on context engineering (June 2025)
- METR — early-2025 AI productivity RCT (19% slower)
- Watson & Strayer (2010) — Supertaskers
- Sophie Leroy — Attention Residue (2009)
- Gloria Mark — shrinking attention spans (APA podcast)
- Context length hurts LLM performance — arXiv 2510.05381
- HBR — When using AI leads to "brain fry"
- Best practices for Claude Code — Anthropic