Launch April 27, 2026 · 5 min read

REM Labs is live: agent memory that actually compounds

Most "AI memory" today is a key-value store with cosine search. Useful, but it does not change what your agent can do. We built REM around a narrower question: can memory across sessions move a benchmark needle on real software-engineering work? After a year of measuring, the answer is yes, by +15.33 percentage points on SWE-bench Lite at n=150 (p<0.05). Today we are opening it up.

The thesis

Agent memory is everywhere. Every framework ships one. Every model gets a "memory" feature on launch. Most of them are storage with a vector index bolted on, and most of them do not move a benchmark. We were skeptical that any of this was real until we built it ourselves.

The thing nobody is shipping is memory that compounds across sessions. Your agent fails on a task, learns something, and the next session starts smarter. Not in a "I remember your name" way. In a "the patch you tried last time blew up at hunk 3, do not produce that hunk again" way. That is what we built. It is a small set of hooks and a retrieval API that runs around your existing model call. Drop-in for Claude Code; npm install for everything else.

The number

Two passes over 150 SWE-bench Lite tasks. Same model (Claude Opus-4.7), same seed, same evaluator (official swebench 4.1.0 docker). Pass 1 cold: 30.00% resolved. Pass 2 with REM injecting the prior pass's failure trace: 45.33% resolved. +15.33 percentage points, 95% CI [+9.33, +22.00], McNemar p<0.05. 26 tasks recovered, 3 regressed.

The mechanism is unsexy and that is the point: 48% fewer apply-errors. Most agents fail SWE-bench tasks not because they cannot reason about the bug, but because they produce a malformed diff and never get to the test step. REM stores Pass 1's apply-error stderr and the disputed hunk's neighborhood, retrieves it before Pass 2, and the model produces a clean diff on the second try. We wrote up the methodology, the per-task analysis, and the limitations in a separate technical post if you want the receipts.

Honest framing. This is a single-seed, single-model paired result on SWE-bench Lite. Not a state-of-the-art claim. Not a "memory solves everything" claim. The CI lower bound at +9.33pp is meaningfully above zero, the b/c ratio is 26:3, and we will publish the predictions JSONL so you can verify it. Take it as a real signal on a specific shape, not a promise.

What's free, what's paid, who it's for

REM is built for developers and their teams. If you ship code with an agent in the loop, REM is yours. If you do not, the consumer surface at remlabs.ai is a separate product (your second brain, morning brief, the works) and you can ignore everything below.

Free

10K memories, 1 namespace, every endpoint. The agent-hooks shape, retrieval, dream consolidation, MCP. This is enough to ship.

Pro

$29 / mo

1M memories, unlimited namespaces, the full Dream Engine pipeline, and a real SLA. Most of you land here.

Team

$99 / seat

Shared namespaces, SSO, audit log, SOC 2 Type I (in flight, Q4 2026). For when more than one person is touching the same agent.

Full pricing is at /pricing-developer. There is no credit card on the free tier. There is no "trial" pretending to be free. Sign up, get a key, ship.

The shape: four hooks, one retrieve call

If you are using claude-code-hooks, you already know the shape. Four lifecycle hooks (PreToolUse, PostToolUse, PostToolUseFailure, SessionEnd) where you store events and retrieve prior context. REM ships an npm package that wires those hooks into our retrieval and dream-consolidation APIs:

npm install @remlabs/agent-hooks-core
export REM_API_KEY=bl_live_...

import { createHooks } from "@remlabs/agent-hooks-core";

export const hooks = createHooks({
  apiKey: process.env.REM_API_KEY,
  namespace: ({ projectId }) => `proj/${projectId}`,
  preToolUse: async ({ retrieve }) => ({
    systemPromptPrefix: await retrieve({ k: 5, mode: "signal" }),
  }),
  postToolUseFailure: async ({ tool, error, store }) => {
    await store({ value: { tool, reason: error.shortReason }, tags: ["failure"] });
  },
});

That is the whole contract. Everything else is plumbing. Full docs at /agent-hooks with adapters for the popular agent loops (LangGraph, Continue, Cursor, MCP, raw OpenAI/Anthropic SDKs).

One thing that did not go to plan

We tried something else first. Our original approach was a "prompt assembler" that stitched together every prior memory it could find and dumped them into a long preamble. It worked great in early ad-hoc tests. It also actively hurt on BigCodeBench-Hard at n=20 with Kimi K2.6: 45% → 40% → 30% across three passes. The model was getting more context and getting worse, because most of the context was off-topic.

The hook shape came out of that finding. The four hooks are the smallest contract we could find that lets you write tightly-scoped events at the right time and retrieve only what is in the current task's namespace. The +15.33pp on SWE-bench Lite is what happens when retrieval is narrow and tagged correctly. The -15pp on BigCodeBench-Hard is what happens when it is not. We owe you the second number alongside the first.

The other place we got humbled: smaller open-weights models hit an "apply-floor" before they hit a reasoning-floor on single-turn SWE-bench. REM cannot rescue a model that cannot produce a syntactically valid diff in the first place. The right play there is REM plus an agent harness that actually executes a turn-of-tool-use, not REM plus a single-shot prompt. We have prototypes wired into Cursor and claude-code-hooks; that is the next surface we polish.

What's next

Open-source @remlabs/agent-hooks-core in May. The repo is private today while we finalize the API. The wire format and the four-hook shape will not change after the open-source flip; if you build against the npm package today, it will keep working.
OAMS spec. The Open Agent Memory Standard is our pitch for a wire format that lets you swap memory backends without rewriting your hooks. Draft is at /oams; comments welcome.
SWE-bench full + multi-model matrix. n=150 Lite is one cell of a much bigger table. We want full-set numbers and at least three model families before we make any "+X pp" claim into a marketing line larger than this post.
Cursor and Continue integrations beyond the wire-up sketch. Today they work; tomorrow they should be a one-line install.
Dream Engine v3 for cross-task consolidation, not just within-task. The Pass 1 → Pass 2 trick is a microcosm of what should happen across an entire project. We are building it.

Show up

Three places to land. Sign up if you want to put it in front of code. Discord if you want to lurk or ask why a thing is the way it is. The technical writeup if you want the per-task analysis and the reproduction recipe.

One ask: if you find a benchmark where REM's hook shape regresses, file an issue. The +15.33pp number is real and we are proud of it; the -15pp on BigCodeBench-Hard is also real and we will not learn the shape if you do not tell us. Memory that compounds is a design problem, and the way you find the shape is by getting humbled in public.

Ship something tonight.

— Rodney
Founder, REM Labs

Get a key. Ship the hook. See the curve.

Free tier. No credit card. Working code in five minutes.

Get started → Read the docs

April 27, 2026 · REM Labs Filed under Launch All posts →