Benchmarks · supporting April 12, 2026

How We Scored 94.6% on LongMemEval

Heads up: this is our supporting technical deep-dive on retrieval quality. Our headline story is now the compounding coding curve — same model, same repo, pass rate climbing week over week as REM ingests build logs, stack traces, and diffs. Memory-on-memory benchmarks are table stakes; coding lift is what ships product.

473 out of 500 questions answered correctly under the byte-exact upstream GPT-4o judge. Competitive with the public leaderboard (MemPalace 96.6%, AgentMemory 96.2%, Chronos 95.6%, Hindsight 94.6%), not #1. No cherry-picking, no fine-tuning on the evaluation set, no rejection sampling. This is a technical breakdown of the architecture, the scoring pipeline, and what we learned.

What Is LongMemEval

LongMemEval is a peer-reviewed benchmark from Wu et al., presented at ICLR 2025. It tests whether a memory system can store, retrieve, and reason over information spread across many conversations over time. The benchmark includes 500 questions across six categories: single-session recall (user statements, assistant statements, and preferences), knowledge update tracking, temporal reasoning, and multi-session synthesis.

The benchmark matters because it tests what real users actually need from a memory system. It is not enough to embed a sentence and cosine-search it back. LongMemEval asks questions that require the system to track evolving facts ("I moved to Berlin" supersedes "I live in London"), perform date arithmetic ("What did I mention two Tuesdays ago?"), and synthesize information scattered across dozens of separate conversations.

Published baselines tell the story of how hard this is. OpenAI's built-in ChatGPT Memory scores 52.9%. Mem0, the most-funded dedicated memory startup, scores 66.9%. Most vector-only retrieval systems land somewhere in the 50-67% range. The ceiling is not retrieval — it is reasoning over retrieved context.

The Architecture That Gets Us to 94.6%

There is no single trick. The score comes from five architectural decisions working together, each one solving a different failure mode that vector-only systems hit.

1. Round-Level Storage

Most memory systems store conversations as session blobs — an entire conversation flattened into one or two embedding vectors. This destroys granularity. When a user says "I prefer TypeScript" in turn 14 of a 30-turn conversation, that preference is buried in an averaged embedding that also encodes the discussion about lunch plans in turn 3.

We store every conversation turn as a separate memory unit. Each turn gets its own embedding, its own full-text index entry, its own entity extraction pass, and its own structured metadata (timestamp, session ID, turn number, speaker role). This means retrieval can target the exact turn where a piece of information was stated, not a diluted summary of the whole session.

2. Four Parallel Retrieval Paths

No single retrieval method works for every query type. A question like "What is my favorite programming language?" is a classic semantic similarity search — embeddings handle it well. But a question like "Did I mention the name Karenina?" is a keyword lookup where embeddings fail (proper nouns and rare words are notoriously poor in embedding space) and full-text search succeeds trivially.

We run four retrieval paths in parallel for every query:

Vector similarity — cosine search over semantic embeddings with multi-query expansion (query variants generated to cover synonyms and rephrasings)
Full-text search — precision-first full-text engine with stop words filtered. This catches exact name matches, acronyms, and technical terms that embeddings miss
Entity graph lookup — structured relationships extracted on ingest. Graph queries find connections that neither embedding nor keyword search surface
Structured metadata — temporal filters, session-ID lookups, and type-based retrieval for questions that are fundamentally about when or where something was said rather than what it meant

Results from all four paths are merged, deduplicated, and passed to neural reranking. The reranker scores each candidate against the original query on a 0-1 relevance scale, producing a single ranked list from heterogeneous sources. This fusion step is critical — it lets us use the best retrieval method for each query without the user (or the system) needing to choose.

3. Precision-First Full-Text Search

This deserves special attention because it was the single largest accuracy jump we found. Adding full-text search to the retrieval pipeline produced an 8-point accuracy gain. We had been indexing on every store, but the index was not connected to the query pipeline. Once connected, proper nouns, exact phrases, and technical terms that embeddings consistently miss were retrieved with high precision.

Our full-text strategy uses a precision-first approach that dramatically reduces false positives. The result set is small and high-confidence, which feeds cleanly into the neural reranker. For proper nouns and multi-word phrases, this is dramatically more reliable than embedding similarity alone.

4. Deterministic Temporal Reasoning

One of LongMemEval's hardest categories is temporal reasoning — questions like "What was the last thing I mentioned about the project?" or "What did I say three conversations ago?" These require date math: comparing timestamps, ordering events, and computing relative time references.

We do not ask the LLM to do date arithmetic. Date math is computed in code. When the query contains temporal references, a preprocessing step resolves them to absolute date ranges, and retrieval is filtered accordingly. "Last Tuesday" becomes a specific date. "The conversation before the one about the merger" becomes a session-ID lookup. The LLM receives pre-filtered, correctly-ordered context and only needs to reason about content, not time.

5. Chain-of-Note Verification and Abstention

The system is explicitly allowed to say "I don't know." When retrieved evidence is thin, contradictory, or below a confidence threshold, the generation step produces an abstention rather than a hallucinated answer. This matters more than it sounds. On the LongMemEval benchmark, a wrong answer is scored as 0 — but so is no answer. The difference is that a hallucinated answer trains users to distrust the system, while an honest abstention preserves trust.

Our Chain-of-Note verification step examines the retrieved evidence before generation. If fewer than 2 high-relevance passages are found, or if the top-ranked passages contradict each other, the system flags the question as low-confidence. In practice, this means we get a small number of questions wrong by abstaining, but we almost never produce confident-sounding incorrect answers.

The Scoring Stack

LongMemEval specifies a particular evaluation method: GPT-4o as the judge. The model receives the question, the expected answer, and the system's response, then determines whether the response is correct. This is the official method from the paper, and we follow it without modification.

The full pipeline for each of the 500 questions:

Ingest — a fresh namespace is created, and 40-60 conversation sessions are loaded (the scenario context for that question)
Retrieve — all four retrieval paths run in parallel, results are fused and reranked by neural reranker
Generate — an LLM produces an answer using the reranked evidence as context
Judge — GPT-4o evaluates the generated answer against the ground truth
Cleanup — the namespace is deleted, ensuring no data leakage between questions

Every question is answered against our production API — the same infrastructure a developer hits when they call the endpoint. No special benchmark mode, no cached results, no parameter tuning.

Per-Category Breakdown

The 94.6% overall score masks real variation across categories. Some are fully solved. Others remain genuinely hard.

Category	Score	Notes
Single-session (User)	100%	Perfect recall of user statements
Single-session (Assistant)	100%	Perfect recall of assistant-side memory
Single-session (Preference)	93.33%	Preference extraction and retrieval
Knowledge update	100%	Tracking evolving facts over time
Temporal reasoning	92.48%	Date math and event ordering
Multi-session	88.72%	Cross-conversation synthesis
Overall	94.6%	473 of 500 questions correct

Where We Excel

Single-session recall is fully solved. User statements (100%) and assistant statements (100%) are both perfect — round-level storage plus multi-path retrieval surfaces them with complete reliability. The 100% score on assistant-side recall means the system remembers not just what the user said, but what the assistant said in response — a capability most memory systems ignore entirely.

Knowledge update tracking at 100% means the system correctly handles superseding information. When a user says "I moved to Berlin" after previously saying "I live in London," the system retrieves the most recent fact. This is powered by the entity graph, which maintains versioned attribute-value pairs with timestamps.

Where We Struggle

Temporal reasoning at 92.48% is the most improved category and no longer our weakest. Despite deterministic date math, some temporal questions require implicit ordering ("the conversation before that one") where the referent is ambiguous. We also see failures on counting questions — "How many times did I mention X?" requires exhaustive retrieval, and missing even one instance produces a wrong count.

Multi-session at 88.72% is the hardest category in the benchmark. These questions require synthesizing information from multiple separate conversations into a coherent answer. The retrieval challenge is that the answer may depend on three or four different sessions, and missing any one of them produces an incomplete response. This is where the entity graph helps most — it provides structural links between sessions that pure embedding search would miss — but there is still room for improvement.

What We Learned

Counting Is Harder Than Reasoning

The most surprising failure mode is counting. Questions like "How many different restaurants did I mention?" require exhaustive retrieval — every relevant memory must be found, with no misses. Semantic search is designed for relevance ranking, not exhaustive recall. Full-text search helps (it catches exact name matches that embeddings miss), but counting questions remain our highest error rate within temporal reasoning.

Abstention Is a Feature

We initially tuned for maximum coverage — trying to answer every question. This produced more wrong answers than right ones on edge cases. Switching to explicit abstention on low-confidence questions improved the overall score by approximately 2 points. The system now says "I don't have enough information to answer that" rather than guessing, and the benchmark rewards this honesty (a non-answer scores the same as a wrong answer, but it doesn't degrade user trust in production).

Retrieval Is the Bottleneck, Not Generation

When we analyze our 27 incorrect answers, the vast majority are retrieval failures — the right information was in the database but was not surfaced in the top results. Generation errors (where the right context was retrieved but the model produced a wrong answer) account for a small fraction of the misses. This means improving retrieval — adding more paths, better reranking, smarter query expansion — will continue to push the score up.

Full-Text Search Is Underrated

The AI memory space has converged on vector embeddings as the default retrieval method. This is a mistake. Full-text search solves an entire class of queries — proper nouns, acronyms, exact phrases, technical terms — that embeddings handle poorly. The 8-point accuracy gain from wiring full-text search into our pipeline was the largest single improvement in our benchmark history. If you are building a memory system and only have embeddings, add a keyword index. It is the highest-ROI change you can make.

What Is Next

94.6% is a competitive result on LongMemEval, but 27 questions wrong out of 500 means there is still meaningful room to improve, particularly in multi-session synthesis and single-session preference edge cases. Our current focus areas:

Exhaustive retrieval mode for counting queries — a separate pipeline that prioritizes recall over precision when the question type is detected as counting
Improved session-chain resolution for temporal references that depend on implicit ordering ("the one before that")
Deeper entity graph integration — using graph traversal as a primary retrieval method for multi-session questions, not just a supplementary signal

We publish every result — pass or fail. If you want to see the raw numbers, the methodology, and the comparison against published baselines, the full benchmarks page has everything.

Reproducibility matters. Our benchmark harness runs against the production API — the same endpoint developers call. If you want to verify our numbers or run your own evaluation, the LongMemEval dataset is publicly available from the original paper.

Build with 94.6% recall accuracy

The same memory engine behind these benchmarks, available as an API.

Get started free →

How We Scored 94.6% on LongMemEval

What Is LongMemEval

The Architecture That Gets Us to 94.6%

1. Round-Level Storage

2. Four Parallel Retrieval Paths

3. Precision-First Full-Text Search

4. Deterministic Temporal Reasoning

5. Chain-of-Note Verification and Abstention

The Scoring Stack

Per-Category Breakdown

Where We Excel

Where We Struggle

What We Learned

Counting Is Harder Than Reasoning

Abstention Is a Feature

Retrieval Is the Bottleneck, Not Generation

Full-Text Search Is Underrated

What Is Next

Related articles

Build with 94.6% recall accuracy