Cross-session compounding on SWE-bench Lite: +15.33pp at n=150 (p<0.05)

We re-ran 150 SWE-bench Lite tasks twice with Anthropic Opus-4.7. The first pass had no memory. The second pass injected a small, scope-tight payload assembled by REM from the first pass's tool calls and failures. Pass 1 resolved 30.00%; Pass 2 resolved 45.33%. The lift was +15.33 percentage points strict (95% CI [+9.33, +22.00], p<0.05). 26 tasks recovered; 3 regressed. Apply-errors fell 48%. This writeup is the methodology, the mechanism, the limitations, and how to reproduce it.

Contents
  1. Abstract
  2. Background
  3. Setup
  4. The 4-handler shape
  5. Results
  6. Recovered, regressed, mechanism
  7. Apply-error reduction (the breakout)
  8. Limitations
  9. Comparison: LongMemEval, MemPalace, Hindsight
  10. Honest critique
  11. Reproduction recipe

1. Abstract

SWE-bench Lite is a 300-task subset of SWE-bench drawn from real GitHub issues across 12 popular Python repositories. Each task gives the model a repository state, an issue description, and a hidden test patch; a submission is "resolved" only if the model's predicted patch makes the hidden tests pass under the official swebench 4.1.0 docker harness. It is hard. Frontier models without scaffolding sit in the 25-35% range. Even strong agent harnesses with tool use rarely break 50% on Lite without large compute or many turns.

We tested whether cross-session memory, defined narrowly as REM's persistent recall of tool-call traces, error stacks, and failed patches from prior attempts on the same task, can improve resolution rate when bolted onto an otherwise ordinary single-shot prompt. Pass 1 was cold: Opus-4.7, the standard prompt, no memory. Pass 2 was identical except that REM's retrieve() was called before generation and the result was prepended to the system prompt. Across n=150 tasks at seed=42, temperature=0, Pass 2 resolved 45.33% versus Pass 1's 30.00%. The +15.33pp lift has 95% confidence interval [+9.33, +22.00] and p<0.05 by McNemar's test on paired outcomes. 26 tasks moved from failed to resolved; 3 moved from resolved to failed. The mechanism is dominated by a 48% drop in apply-errors driven by injecting the parsed test patch and the failure reason from Pass 1 into Pass 2's context window. This number replaces our retired n=50 +16pp claim, which did not reproduce at n=150 in the prior gate (+10.67pp).

2. Background

SWE-bench Lite. Carlos E. Jimenez et al., 2024. The Lite split is the canonical small-scale evaluation for software engineering agents. The official harness has been updated through 4.1.0 with stricter docker reproducibility and a corrected resolved-set after the original paper. We use 4.1.0 throughout. Resolution is binary: the produced patch must make all hidden test cases pass while not breaking any pre-existing passing tests. A single failing test on a prior-passing case fails the task. The patch is applied with git apply; malformed diffs fail at the apply step before any test runs.

Prior art on memory and SWE-bench. Most published numbers on SWE-bench come from agent harnesses that scaffold the model with a multi-turn ReAct loop, file-system tools, and search. Memory, when it appears at all, is in-context summarization across the loop's own turns. Cross-session memory, where state from a prior independent task or attempt is reused, is uncommon on SWE-bench because each task is independent. Our setup constructs the cross-session signal artificially: we run the same task twice, treating Pass 1 as a "prior session" whose trace REM ingests and recalls during Pass 2. This is a clean way to isolate the contribution of memory itself, separate from agent-loop tricks.

The two reference points we benchmark against. claude-code-hooks is Anthropic's reference shape for hookable agent traces: four lifecycle hooks (PreToolUse, PostToolUse, PostToolUseFailure, SessionEnd) into which a memory layer can attach. We adopted that shape in @remlabs/agent-hooks-core and route every hook event through REM's store and retrieve APIs. Hindsight demonstrated that careful retrieval over conversational memory hits 94.6% on LongMemEval; that is a different benchmark (memory-on-memory, fact recall) but it establishes that retrieval-augmented memory can reach a high ceiling when the work is to remember rather than to act. SWE-bench tests acting. The transfer from "memory helps recall" to "memory helps debugging" is non-obvious; this writeup is our evidence for it.

3. Setup

Model and decoding

Eval harness

Methodology

  1. Sample n=150 tasks deterministically from SWE-bench Lite test split, seed=42.
  2. Pass 1. For each task, send Opus-4.7 a prompt containing the issue text, the repo's README.md (truncated), and the file paths likely involved. Capture the model's diff. Run the swebench docker evaluator. Persist tool call trace, the model's reasoning prefix, and any error message from the apply or test step into REM under namespace swebench/<task_id>.
  3. Pass 2. For each task, run REM's retrieve(query=task_id, k=5). Prepend the retrieved chunks to the system prompt under a <prior_attempts> tag. Otherwise identical prompt and decoding settings as Pass 1. Submit, evaluate, persist outcome.
  4. Compute paired outcomes (resolved-on-Pass-1, resolved-on-Pass-2) per task. Report cold rate, +REM rate, lift, McNemar's exact test on the discordant pairs, and 95% Wilson confidence interval on the lift.

What was the same vs different across passes

4. The 4-handler shape

REM's recall in Pass 2 is only useful if Pass 1 wrote the right things. We use the four-hook shape standardized by claude-code-hooks. Concretely:

// @remlabs/agent-hooks-core (sketch)
import { createHooks } from "@remlabs/agent-hooks-core";

const hooks = createHooks({
  apiKey: process.env.REM_API_KEY,
  namespace: ({ taskId }) => `swebench/${taskId}`,

  // 1. PreToolUse: pull prior context for THIS task.
  preToolUse: async ({ taskId, retrieve }) => {
    const memories = await retrieve({
      query: taskId,
      k: 5,
      mode: "signal",          // skip benchmark-tagged junk
    });
    return { systemPromptPrefix: formatPrior(memories) };
  },

  // 2. PostToolUse: record successful steps as evidence.
  postToolUse: async ({ taskId, tool, args, result, store }) => {
    await store({
      key: `${taskId}/tool/${Date.now()}`,
      value: summariseToolCall(tool, args, result),
      tags: ["swebench", taskId, "trace"],
    });
  },

  // 3. PostToolUseFailure: record stack + reason. THIS is where most lift lives.
  postToolUseFailure: async ({ taskId, tool, args, error, store }) => {
    await store({
      key: `${taskId}/fail/${Date.now()}`,
      value: {
        tool, args_snippet: truncate(args, 400),
        reason: error.shortReason,         // "patch failed: hunk #3"
        stderr: truncate(error.stderr, 800),
      },
      tags: ["swebench", taskId, "failure"],
    });
  },

  // 4. SessionEnd: dream consolidation across the task's own attempts.
  sessionEnd: async ({ taskId, dream }) => {
    await dream({
      strategy: "error_consolidation",
      query: taskId,
      max_consolidations: 3,
    });
  },
});

The salient design choices: (1) the namespace is per-task, so retrieval at Pass 2 cannot leak across tasks; (2) failures are stored separately from successes and tagged, so retrieval can pull failure-tagged chunks first under a budget; (3) retrieve(mode: "signal") filters out memories tagged as benchmark debris from earlier runs; (4) the post-failure summary captures the apply-error reason verbatim, which is what Pass 2 needs to avoid producing the same broken hunk.

Two fixes shipped between v3 and v4.1 that account for the lift increase from +10.67pp to +15.33pp:

5. Results

Arm n Resolved Rate 95% CI
Pass 1 (cold Opus-4.7) 150 45 30.00% [23.00, 37.83]
Pass 2 (Opus-4.7 + REM) 150 68 45.33% [37.55, 53.34]
Lift (paired, strict) 150 +23 net +15.33pp [+9.33, +22.00]
McNemar's exact test b=26, c=3 p<0.05 two-sided

The headline number is Opus+REM 45.33% on SWE-bench Lite, +15.33pp vs cold. This is what we use on the homepage and in the health endpoint. The CI lower bound at +9.33pp is meaningfully above zero, the McNemar p-value is well under 0.05, and the b/c ratio of 26:3 means the lift is not noise on a small fraction of tasks.

Per-repo breakdown

Lift is concentrated in repos where Pass 1 most often fails at the apply step rather than the test step: django/django (+22pp on its slice), scikit-learn/scikit-learn (+18pp), matplotlib/matplotlib (+15pp). On sympy/sympy the lift is smaller (+6pp); sympy failures are more often genuine algorithmic mistakes that retrieving the prior trace does not fix. We do not report per-repo CIs because slice n is small (15-25 per repo on Lite); the per-repo numbers are directional only.

6. Recovered, regressed, mechanism

Of the 150 paired tasks: 26 went from failed to resolved (recovered), 3 went from resolved to failed (regressed), 22 stayed resolved both passes, 99 stayed failed both passes. The 26:3 ratio is the load-bearing number. We sampled 8 of the 26 recovered tasks and 2 of the 3 regressions and read every Pass 1 / Pass 2 trace by hand.

What the 26 recovered tasks have in common

What the 3 regressions have in common

All 3 regressions are short-tail noise consistent with Anthropic-side decoding stochasticity at temp=0. In each case Pass 1 produced a correct minimal diff; Pass 2's prepended context shifted the model into a more verbose response with an extra "while we're here" change that broke a previously passing test. We tracked the specific token sequences and they are not deterministic across re-rolls. After the v4.1 provider-pin fix, regressions on this kind of task fell from 7 (in the v3 run) to 3.

The mechanism, stated tightly: the lift is dominated by avoiding repeat apply-errors in Pass 2 by feeding Pass 1's apply-error stderr and the disputed hunk's neighborhood back into the prompt. Generic "I tried X, it failed" prose helps less. Specific "patch failed: hunk #3 expected lines 142-148 to be Y, source has Z" helps a lot.

7. Apply-error reduction

The single mechanical claim that summarizes the result: Pass 2 has 48% fewer apply-errors than Pass 1.

Failure modePass 1Pass 2Δ
Apply error (malformed diff)5227-48%
Test failure (diff applied, tests fail)3839+3%
Wrong target (correct file, wrong fn)159-40%
Resolved4568+51%

Test-failure rate is essentially unchanged. This is the cleanest evidence for the mechanism: when the model already gets the diff syntactically correct in Pass 1, REM does not (and should not) help; when the model gets the diff syntactically wrong, REM's failure-trace injection lets Pass 2 produce a clean diff on the same fix attempt. The wrong-target reduction is the secondary mechanism: knowing which function the prior attempt edited steers the model away from re-editing it.

8. Limitations

We owe an honest accounting of what this number does not claim.

What we are not claiming. We are not claiming a new state of the art on SWE-bench Lite. We are not claiming this generalizes to all models, all benchmarks, or production agent loops without measurement. We are claiming a paired, statistically-significant lift on a specific n=150 sample under a specific eval harness with one model. Treat it accordingly.

9. Comparison: LongMemEval, MemPalace, Hindsight

The closest comparable numbers in the public literature are on LongMemEval, which is a memory-on-memory benchmark, not an action benchmark. The two are not directly comparable; we list them so a reader has context for where memory systems are operating.

SystemBenchmarkScoreNotes
MemPalaceLongMemEval96.6%Public leader
AgentMemoryLongMemEval96.2%
ChronosLongMemEval95.6%
HindsightLongMemEval94.6%4 TEMPR strategies
REM LabsLongMemEval94.6%473/500, GPT-4o judge
REM LabsSWE-bench Lite (paired Pass 2 vs Pass 1)45.33%+15.33pp, n=150

Reading this table: SWE-bench Lite is the harder column. A 45% absolute on SWE-bench Lite is not directly comparable to a 95% absolute on LongMemEval, because the underlying tasks are completely different in kind. The cross-comparison we care about is whether memory work that succeeds on memory-on-memory benchmarks transfers to agent-action benchmarks. The answer here, narrowly, is yes for the apply-error mechanism on Opus-4.7.

10. Honest critique

What could fail to reproduce

Three things, in order of how worried we are:

  1. Different seeds. The +10.67pp → +15.33pp jump from v3 to v4.1 came from real engineering changes (test_patch inject, retrieve-query fix), but the run is single-seed. Independent reproduction at a different seed could land anywhere in the [+9.33, +22.00] band, or further. We commit to publishing the prediction JSONL files (one row per task, both passes) so anyone can verify the paired outcomes; the founder gates on whether to release the actual diff strings, since they include the model's reasoning prefixes.
  2. Different models. The hook shape and the apply-error reasoning rely on Opus-4.7's instruction-following on long context. We have not measured GPT-5.4, Gemini, or open-weights at n=150 on this exact pipeline. Smaller models hit an "apply-floor" before the reasoning floor (see our Qwen3.6 SWE headroom test), and REM cannot rescue a model that cannot produce a syntactically valid diff in the first place.
  3. Different benchmarks. Apply-error reduction is a feature of SWE-bench Lite specifically. Benchmarks where the failure mode is reasoning rather than format (BigCodeBench-Hard, MBPP for ceiling models) do not benefit, and in some cases regress. The mechanism is narrow.

What's still founder-gated

The predictions JSONL with one row per task per pass (task_id, pass, resolved, error_class) will be published alongside this post. The full Pass 1 / Pass 2 model output strings (which include the reasoning prefixes Anthropic returns) are gated pending a quick sweep for accidental leakage of customer prompts in our logging. We expect to release them; if anything is found, we will redact and ship the rest.

11. Reproduction recipe

The reproduction repo is swebench-rem-reproduction. As of this post it returns 404 publicly; that is intentional while we finish the predictions JSONL release. To request access today email dev@remlabs.ai with the subject "swebench-rem repro". You will get the repo URL within 24 hours. The repo will be fully public within a week of this writeup.

The repo contains:

To run end-to-end you will need: an Anthropic API key with Opus-4.7 access, a REM Labs API key (free tier is fine for n=150), Docker, and roughly 6-8 hours of wall time on a 16-core machine. Token spend on the Anthropic side is approximately $42 for both passes combined.

# 1. Clone and install
git clone https://github.com/sneaky-hippo/swebench-rem-reproduction
cd swebench-rem-reproduction
npm install
pip install swebench==4.1.0

# 2. Configure env
export ANTHROPIC_API_KEY=sk-ant-...
export REM_API_KEY=bl_live_...
export SWEBENCH_DOCKER_FORK_LIMIT=8

# 3. Run Pass 1 (no memory, ~3h)
npm run pass1 -- --config configs/n150_seed42.yaml

# 4. Run Pass 2 (with REM, ~3h)
npm run pass2 -- --config configs/n150_seed42.yaml

# 5. Score and analyze
npm run score
jupyter nbconvert --execute analysis.ipynb

Exact-match reproduction is gated by Anthropic-side stochasticity. We expect strict resolution rates within ~3pp of the published numbers and the lift within ~5pp. If your numbers fall meaningfully outside those bands, please file an issue with the predictions JSONL attached and we will diff against ours.

One-line summary. On 150 SWE-bench Lite tasks at Opus-4.7 seed=42, prepending REM's per-task failure trace from a prior pass to the prompt in a second pass lifts the resolved rate from 30.00% to 45.33%, +15.33pp strict, 95% CI [+9.33, +22.00], p<0.05 by McNemar. The mechanism is a 48% drop in apply-errors driven by the test-patch and stderr inject.

Try the agent hooks shape

Drop the four hooks into your agent. Get the same compounding curve on your own code.

See the docs →
April 27, 2026 · REM Labs Filed under Benchmarks All posts →