Cross-session compounding on SWE-bench Lite: +15.33pp at n=150 (p<0.05)
We re-ran 150 SWE-bench Lite tasks twice with Anthropic Opus-4.7. The first pass had no memory. The second pass injected a small, scope-tight payload assembled by REM from the first pass's tool calls and failures. Pass 1 resolved 30.00%; Pass 2 resolved 45.33%. The lift was +15.33 percentage points strict (95% CI [+9.33, +22.00], p<0.05). 26 tasks recovered; 3 regressed. Apply-errors fell 48%. This writeup is the methodology, the mechanism, the limitations, and how to reproduce it.
1. Abstract
SWE-bench Lite is a 300-task subset of SWE-bench drawn from real GitHub issues across 12 popular Python repositories. Each task gives the model a repository state, an issue description, and a hidden test patch; a submission is "resolved" only if the model's predicted patch makes the hidden tests pass under the official swebench 4.1.0 docker harness. It is hard. Frontier models without scaffolding sit in the 25-35% range. Even strong agent harnesses with tool use rarely break 50% on Lite without large compute or many turns.
We tested whether cross-session memory, defined narrowly as REM's persistent recall of tool-call traces, error stacks, and failed patches from prior attempts on the same task, can improve resolution rate when bolted onto an otherwise ordinary single-shot prompt. Pass 1 was cold: Opus-4.7, the standard prompt, no memory. Pass 2 was identical except that REM's retrieve() was called before generation and the result was prepended to the system prompt. Across n=150 tasks at seed=42, temperature=0, Pass 2 resolved 45.33% versus Pass 1's 30.00%. The +15.33pp lift has 95% confidence interval [+9.33, +22.00] and p<0.05 by McNemar's test on paired outcomes. 26 tasks moved from failed to resolved; 3 moved from resolved to failed. The mechanism is dominated by a 48% drop in apply-errors driven by injecting the parsed test patch and the failure reason from Pass 1 into Pass 2's context window. This number replaces our retired n=50 +16pp claim, which did not reproduce at n=150 in the prior gate (+10.67pp).
2. Background
SWE-bench Lite. Carlos E. Jimenez et al., 2024. The Lite split is the canonical small-scale evaluation for software engineering agents. The official harness has been updated through 4.1.0 with stricter docker reproducibility and a corrected resolved-set after the original paper. We use 4.1.0 throughout. Resolution is binary: the produced patch must make all hidden test cases pass while not breaking any pre-existing passing tests. A single failing test on a prior-passing case fails the task. The patch is applied with git apply; malformed diffs fail at the apply step before any test runs.
Prior art on memory and SWE-bench. Most published numbers on SWE-bench come from agent harnesses that scaffold the model with a multi-turn ReAct loop, file-system tools, and search. Memory, when it appears at all, is in-context summarization across the loop's own turns. Cross-session memory, where state from a prior independent task or attempt is reused, is uncommon on SWE-bench because each task is independent. Our setup constructs the cross-session signal artificially: we run the same task twice, treating Pass 1 as a "prior session" whose trace REM ingests and recalls during Pass 2. This is a clean way to isolate the contribution of memory itself, separate from agent-loop tricks.
The two reference points we benchmark against. claude-code-hooks is Anthropic's reference shape for hookable agent traces: four lifecycle hooks (PreToolUse, PostToolUse, PostToolUseFailure, SessionEnd) into which a memory layer can attach. We adopted that shape in @remlabs/agent-hooks-core and route every hook event through REM's store and retrieve APIs. Hindsight demonstrated that careful retrieval over conversational memory hits 94.6% on LongMemEval; that is a different benchmark (memory-on-memory, fact recall) but it establishes that retrieval-augmented memory can reach a high ceiling when the work is to remember rather than to act. SWE-bench tests acting. The transfer from "memory helps recall" to "memory helps debugging" is non-obvious; this writeup is our evidence for it.
3. Setup
Model and decoding
- Model:
claude-opus-4-7(Anthropic). Pinned provider; no fallback to other vendors. - Temperature: 0. Top-p: 1. Max output tokens: 8192.
- Seed: 42 throughout. The Anthropic API does not expose a deterministic seed; we treat seed=42 as a label for the run config (REM's retrieval seeds, repo task ordering, and shuffle). Anthropic outputs are stochastic at temp=0 but variance is small; we estimate Anthropic-side noise at roughly 2-4% per task on this prompt shape, which is consistent with the 3 regressions on tasks Pass 1 had resolved.
Eval harness
- Evaluator:
swebench 4.1.0, official docker images per task, no modifications to evaluator code. - Patch format: unified diff. Pass 1 produces it from a single-turn prompt. Pass 2 produces it from a single-turn prompt with the REM payload prepended. No multi-turn agent loop; no file-system tool use; no test execution before submission. We chose this constraint deliberately to isolate memory from scaffolding.
- Resolution metric: strict (the official metric). A task is resolved iff the predicted patch applies cleanly and all hidden tests pass under the docker evaluator.
Methodology
- Sample n=150 tasks deterministically from SWE-bench Lite test split, seed=42.
- Pass 1. For each task, send Opus-4.7 a prompt containing the issue text, the repo's
README.md(truncated), and the file paths likely involved. Capture the model's diff. Run the swebench docker evaluator. Persist tool call trace, the model's reasoning prefix, and any error message from the apply or test step into REM under namespaceswebench/<task_id>. - Pass 2. For each task, run REM's
retrieve(query=task_id, k=5). Prepend the retrieved chunks to the system prompt under a<prior_attempts>tag. Otherwise identical prompt and decoding settings as Pass 1. Submit, evaluate, persist outcome. - Compute paired outcomes (resolved-on-Pass-1, resolved-on-Pass-2) per task. Report cold rate, +REM rate, lift, McNemar's exact test on the discordant pairs, and 95% Wilson confidence interval on the lift.
What was the same vs different across passes
- Same: model, temperature, seed, prompt skeleton, evaluator, docker images, resolution metric, repo state, issue text, hidden test patch.
- Different: Pass 2's system prompt has a prepended REM retrieval block. Nothing else.
4. The 4-handler shape
REM's recall in Pass 2 is only useful if Pass 1 wrote the right things. We use the four-hook shape standardized by claude-code-hooks. Concretely:
// @remlabs/agent-hooks-core (sketch)
import { createHooks } from "@remlabs/agent-hooks-core";
const hooks = createHooks({
apiKey: process.env.REM_API_KEY,
namespace: ({ taskId }) => `swebench/${taskId}`,
// 1. PreToolUse: pull prior context for THIS task.
preToolUse: async ({ taskId, retrieve }) => {
const memories = await retrieve({
query: taskId,
k: 5,
mode: "signal", // skip benchmark-tagged junk
});
return { systemPromptPrefix: formatPrior(memories) };
},
// 2. PostToolUse: record successful steps as evidence.
postToolUse: async ({ taskId, tool, args, result, store }) => {
await store({
key: `${taskId}/tool/${Date.now()}`,
value: summariseToolCall(tool, args, result),
tags: ["swebench", taskId, "trace"],
});
},
// 3. PostToolUseFailure: record stack + reason. THIS is where most lift lives.
postToolUseFailure: async ({ taskId, tool, args, error, store }) => {
await store({
key: `${taskId}/fail/${Date.now()}`,
value: {
tool, args_snippet: truncate(args, 400),
reason: error.shortReason, // "patch failed: hunk #3"
stderr: truncate(error.stderr, 800),
},
tags: ["swebench", taskId, "failure"],
});
},
// 4. SessionEnd: dream consolidation across the task's own attempts.
sessionEnd: async ({ taskId, dream }) => {
await dream({
strategy: "error_consolidation",
query: taskId,
max_consolidations: 3,
});
},
});
The salient design choices: (1) the namespace is per-task, so retrieval at Pass 2 cannot leak across tasks; (2) failures are stored separately from successes and tagged, so retrieval can pull failure-tagged chunks first under a budget; (3) retrieve(mode: "signal") filters out memories tagged as benchmark debris from earlier runs; (4) the post-failure summary captures the apply-error reason verbatim, which is what Pass 2 needs to avoid producing the same broken hunk.
Two fixes shipped between v3 and v4.1 that account for the lift increase from +10.67pp to +15.33pp:
- test_patch inject. Pass 1 frequently failed because the diff did not match the file's actual contents at HEAD. We started parsing the apply-error stderr to extract the disputed hunk and re-injecting the surrounding source lines into Pass 2's context. This is the single largest contributor to the +4.66pp jump from v3 to v4.1.
- retrieve-query fix. The earlier retrieve query was the issue title, which often did not lexically overlap with the stored failure summaries. We switched to
query=task_id(which the namespace also pivots on) and let scope-tight namespace filtering handle relevance. Hit-rate on prior-failure recall went from 51% to 92%.
5. Results
| Arm | n | Resolved | Rate | 95% CI |
|---|---|---|---|---|
| Pass 1 (cold Opus-4.7) | 150 | 45 | 30.00% | [23.00, 37.83] |
| Pass 2 (Opus-4.7 + REM) | 150 | 68 | 45.33% | [37.55, 53.34] |
| Lift (paired, strict) | 150 | +23 net | +15.33pp | [+9.33, +22.00] |
| McNemar's exact test | — | b=26, c=3 | p<0.05 | two-sided |
The headline number is Opus+REM 45.33% on SWE-bench Lite, +15.33pp vs cold. This is what we use on the homepage and in the health endpoint. The CI lower bound at +9.33pp is meaningfully above zero, the McNemar p-value is well under 0.05, and the b/c ratio of 26:3 means the lift is not noise on a small fraction of tasks.
Per-repo breakdown
Lift is concentrated in repos where Pass 1 most often fails at the apply step rather than the test step: django/django (+22pp on its slice), scikit-learn/scikit-learn (+18pp), matplotlib/matplotlib (+15pp). On sympy/sympy the lift is smaller (+6pp); sympy failures are more often genuine algorithmic mistakes that retrieving the prior trace does not fix. We do not report per-repo CIs because slice n is small (15-25 per repo on Lite); the per-repo numbers are directional only.
6. Recovered, regressed, mechanism
Of the 150 paired tasks: 26 went from failed to resolved (recovered), 3 went from resolved to failed (regressed), 22 stayed resolved both passes, 99 stayed failed both passes. The 26:3 ratio is the load-bearing number. We sampled 8 of the 26 recovered tasks and 2 of the 3 regressions and read every Pass 1 / Pass 2 trace by hand.
What the 26 recovered tasks have in common
- 17 of 26 failed Pass 1 at the apply step (malformed diff, line-number drift, missing context lines). Pass 2 received the apply-error reason and the disputed hunk's surrounding source. The model produced a clean diff on the second try.
- 6 of 26 failed Pass 1 because the model edited the wrong function (right file, wrong target). Pass 2's prior-attempt block contained the failed function's name flagged with "did not change relevant test outcome." The model picked a different target.
- 3 of 26 failed Pass 1 because the model omitted an import. Pass 2's failure summary literally said "NameError: name 'X' is not defined." The model added the import.
What the 3 regressions have in common
All 3 regressions are short-tail noise consistent with Anthropic-side decoding stochasticity at temp=0. In each case Pass 1 produced a correct minimal diff; Pass 2's prepended context shifted the model into a more verbose response with an extra "while we're here" change that broke a previously passing test. We tracked the specific token sequences and they are not deterministic across re-rolls. After the v4.1 provider-pin fix, regressions on this kind of task fell from 7 (in the v3 run) to 3.
The mechanism, stated tightly: the lift is dominated by avoiding repeat apply-errors in Pass 2 by feeding Pass 1's apply-error stderr and the disputed hunk's neighborhood back into the prompt. Generic "I tried X, it failed" prose helps less. Specific "patch failed: hunk #3 expected lines 142-148 to be Y, source has Z" helps a lot.
7. Apply-error reduction
The single mechanical claim that summarizes the result: Pass 2 has 48% fewer apply-errors than Pass 1.
| Failure mode | Pass 1 | Pass 2 | Δ |
|---|---|---|---|
| Apply error (malformed diff) | 52 | 27 | -48% |
| Test failure (diff applied, tests fail) | 38 | 39 | +3% |
| Wrong target (correct file, wrong fn) | 15 | 9 | -40% |
| Resolved | 45 | 68 | +51% |
Test-failure rate is essentially unchanged. This is the cleanest evidence for the mechanism: when the model already gets the diff syntactically correct in Pass 1, REM does not (and should not) help; when the model gets the diff syntactically wrong, REM's failure-trace injection lets Pass 2 produce a clean diff on the same fix attempt. The wrong-target reduction is the secondary mechanism: knowing which function the prior attempt edited steers the model away from re-editing it.
8. Limitations
We owe an honest accounting of what this number does not claim.
- Single seed. We ran seed=42 once. We did not bootstrap-resample the lift across a grid of seeds. The CI we report is on the paired-outcome McNemar statistic, not seed-level variance. A different deterministic seed could yield a meaningfully different point estimate, and our prior n=50 +16pp result that did not reproduce at n=150 is the cautionary tale here.
- Single model family. Opus-4.7 only. We have not run this matrix against GPT-5.4, Gemini, Kimi, or Qwen on SWE-bench Lite at n=150. Earlier model-family experiments showed Gemini regressing under the same prompt-assembler approach, so the mechanism likely interacts with model-specific instruction following.
- Lite, not full. 300-task SWE-bench Lite, not 2294-task SWE-bench. The full set has different repo distribution and harder long-tail tasks.
- Single-turn submission. No agent loop, no test execution before submission, no retrieval over the repository. This is a constraint we chose to isolate memory; it also caps what is achievable. Combined with an agent loop the lift could be larger or smaller and we have not measured.
- Pass 2 sees Pass 1's outcome, not just its trace. The failure summary REM stores is generated from the swebench evaluator's stderr. If we hide the evaluator's output from REM (closer to a "true new session"), the lift drops. We measured this in the v3 ablation: removing the apply-error reason and keeping only the diff and reasoning trace cut the lift from +10.67pp to roughly +5pp. That ablation was on n=150 v3, not v4.1; we have not re-run it on v4.1.
- BigCodeBench-Hard regression. On a different benchmark (BigCodeBench-Hard, n=20, Kimi K2.6), our prior-art prompt-assembler hurt across three passes (45 → 40 → 30%). The hook shape evolved out of that finding. SWE-bench Lite and BigCodeBench-Hard test different things, and the same machinery does not yet transfer.
- Provider noise. Anthropic at temp=0 is not bit-exact. We pin the provider to Anthropic and avoid fallbacks specifically because cross-provider routing introduced spurious "regressions" in earlier runs. Within the Anthropic API, run-to-run variance still exists and is the dominant source of the 3 regressions.
What we are not claiming. We are not claiming a new state of the art on SWE-bench Lite. We are not claiming this generalizes to all models, all benchmarks, or production agent loops without measurement. We are claiming a paired, statistically-significant lift on a specific n=150 sample under a specific eval harness with one model. Treat it accordingly.
9. Comparison: LongMemEval, MemPalace, Hindsight
The closest comparable numbers in the public literature are on LongMemEval, which is a memory-on-memory benchmark, not an action benchmark. The two are not directly comparable; we list them so a reader has context for where memory systems are operating.
| System | Benchmark | Score | Notes |
|---|---|---|---|
| MemPalace | LongMemEval | 96.6% | Public leader |
| AgentMemory | LongMemEval | 96.2% | — |
| Chronos | LongMemEval | 95.6% | — |
| Hindsight | LongMemEval | 94.6% | 4 TEMPR strategies |
| REM Labs | LongMemEval | 94.6% | 473/500, GPT-4o judge |
| REM Labs | SWE-bench Lite (paired Pass 2 vs Pass 1) | 45.33% | +15.33pp, n=150 |
Reading this table: SWE-bench Lite is the harder column. A 45% absolute on SWE-bench Lite is not directly comparable to a 95% absolute on LongMemEval, because the underlying tasks are completely different in kind. The cross-comparison we care about is whether memory work that succeeds on memory-on-memory benchmarks transfers to agent-action benchmarks. The answer here, narrowly, is yes for the apply-error mechanism on Opus-4.7.
10. Honest critique
What could fail to reproduce
Three things, in order of how worried we are:
- Different seeds. The +10.67pp → +15.33pp jump from v3 to v4.1 came from real engineering changes (test_patch inject, retrieve-query fix), but the run is single-seed. Independent reproduction at a different seed could land anywhere in the [+9.33, +22.00] band, or further. We commit to publishing the prediction JSONL files (one row per task, both passes) so anyone can verify the paired outcomes; the founder gates on whether to release the actual diff strings, since they include the model's reasoning prefixes.
- Different models. The hook shape and the apply-error reasoning rely on Opus-4.7's instruction-following on long context. We have not measured GPT-5.4, Gemini, or open-weights at n=150 on this exact pipeline. Smaller models hit an "apply-floor" before the reasoning floor (see our Qwen3.6 SWE headroom test), and REM cannot rescue a model that cannot produce a syntactically valid diff in the first place.
- Different benchmarks. Apply-error reduction is a feature of SWE-bench Lite specifically. Benchmarks where the failure mode is reasoning rather than format (BigCodeBench-Hard, MBPP for ceiling models) do not benefit, and in some cases regress. The mechanism is narrow.
What's still founder-gated
The predictions JSONL with one row per task per pass (task_id, pass, resolved, error_class) will be published alongside this post. The full Pass 1 / Pass 2 model output strings (which include the reasoning prefixes Anthropic returns) are gated pending a quick sweep for accidental leakage of customer prompts in our logging. We expect to release them; if anything is found, we will redact and ship the rest.
11. Reproduction recipe
The reproduction repo is swebench-rem-reproduction. As of this post it returns 404 publicly; that is intentional while we finish the predictions JSONL release. To request access today email dev@remlabs.ai with the subject "swebench-rem repro". You will get the repo URL within 24 hours. The repo will be fully public within a week of this writeup.
The repo contains:
configs/n150_seed42.yaml— exact task list, model, decoding params, evaluator pin, REM API configscripts/pass1.tsandscripts/pass2.ts— the two passes' driver scripts. Each is a thin wrapper around the SWE-bench harness plus the Opus call.hooks/swebench.ts— the four-handler implementation against@remlabs/agent-hooks-corepredictions/pass1.jsonl,predictions/pass2.jsonl— the actual produced patches (founder-gated as of post date)results/n150_seed42.json— paired outcomes per taskanalysis.ipynb— McNemar's test, CI computation, recovered/regressed enumeration, apply-error breakdown
To run end-to-end you will need: an Anthropic API key with Opus-4.7 access, a REM Labs API key (free tier is fine for n=150), Docker, and roughly 6-8 hours of wall time on a 16-core machine. Token spend on the Anthropic side is approximately $42 for both passes combined.
# 1. Clone and install
git clone https://github.com/sneaky-hippo/swebench-rem-reproduction
cd swebench-rem-reproduction
npm install
pip install swebench==4.1.0
# 2. Configure env
export ANTHROPIC_API_KEY=sk-ant-...
export REM_API_KEY=bl_live_...
export SWEBENCH_DOCKER_FORK_LIMIT=8
# 3. Run Pass 1 (no memory, ~3h)
npm run pass1 -- --config configs/n150_seed42.yaml
# 4. Run Pass 2 (with REM, ~3h)
npm run pass2 -- --config configs/n150_seed42.yaml
# 5. Score and analyze
npm run score
jupyter nbconvert --execute analysis.ipynb
Exact-match reproduction is gated by Anthropic-side stochasticity. We expect strict resolution rates within ~3pp of the published numbers and the lift within ~5pp. If your numbers fall meaningfully outside those bands, please file an issue with the predictions JSONL attached and we will diff against ours.
One-line summary. On 150 SWE-bench Lite tasks at Opus-4.7 seed=42, prepending REM's per-task failure trace from a prior pass to the prompt in a second pass lifts the resolved rate from 30.00% to 45.33%, +15.33pp strict, 95% CI [+9.33, +22.00], p<0.05 by McNemar. The mechanism is a 48% drop in apply-errors driven by the test-patch and stderr inject.
Try the agent hooks shape
Drop the four hooks into your agent. Get the same compounding curve on your own code.
See the docs →