How We Scored 97.2% on LongMemEval
486 out of 500 questions answered correctly. No cherry-picking, no fine-tuning on the evaluation set, no exclusions. This is a technical breakdown of the architecture, the scoring pipeline, and what we learned about the hardest parts of long-term AI memory.
What Is LongMemEval
LongMemEval is a peer-reviewed benchmark from Wu et al., presented at ICLR 2025. It tests whether a memory system can store, retrieve, and reason over information spread across many conversations over time. The benchmark includes 500 questions across six categories: single-session recall (user statements, assistant statements, and preferences), knowledge update tracking, temporal reasoning, and multi-session synthesis.
The benchmark matters because it tests what real users actually need from a memory system. It is not enough to embed a sentence and cosine-search it back. LongMemEval asks questions that require the system to track evolving facts ("I moved to Berlin" supersedes "I live in London"), perform date arithmetic ("What did I mention two Tuesdays ago?"), and synthesize information scattered across dozens of separate conversations.
Published baselines tell the story of how hard this is. OpenAI's built-in ChatGPT Memory scores 52.9%. Mem0, the most-funded dedicated memory startup, scores 66.9%. Most vector-only retrieval systems land somewhere in the 50-67% range. The ceiling is not retrieval — it is reasoning over retrieved context.
The Architecture That Gets Us to 97.2%
There is no single trick. The score comes from five architectural decisions working together, each one solving a different failure mode that vector-only systems hit.
1. Round-Level Storage
Most memory systems store conversations as session blobs — an entire conversation flattened into one or two embedding vectors. This destroys granularity. When a user says "I prefer TypeScript" in turn 14 of a 30-turn conversation, that preference is buried in an averaged embedding that also encodes the discussion about lunch plans in turn 3.
We store every conversation turn as a separate memory unit. Each turn gets its own embedding, its own full-text index entry, its own entity extraction pass, and its own structured metadata (timestamp, session ID, turn number, speaker role). This means retrieval can target the exact turn where a piece of information was stated, not a diluted summary of the whole session.
2. Four Parallel Retrieval Paths
No single retrieval method works for every query type. A question like "What is my favorite programming language?" is a classic semantic similarity search — embeddings handle it well. But a question like "Did I mention the name Karenina?" is a keyword lookup where embeddings fail (proper nouns and rare words are notoriously poor in embedding space) and full-text search succeeds trivially.
We run four retrieval paths in parallel for every query:
- Vector similarity — cosine search over semantic embeddings with multi-query expansion (query variants generated to cover synonyms and rephrasings)
- Full-text search — precision-first full-text engine with stop words filtered. This catches exact name matches, acronyms, and technical terms that embeddings miss
- Entity graph lookup — structured relationships extracted on ingest. Graph queries find connections that neither embedding nor keyword search surface
- Structured metadata — temporal filters, session-ID lookups, and type-based retrieval for questions that are fundamentally about when or where something was said rather than what it meant
Results from all four paths are merged, deduplicated, and passed to neural reranking. The reranker scores each candidate against the original query on a 0-1 relevance scale, producing a single ranked list from heterogeneous sources. This fusion step is critical — it lets us use the best retrieval method for each query without the user (or the system) needing to choose.
3. Precision-First Full-Text Search
This deserves special attention because it was the single largest accuracy jump we found. Before wiring full-text search into the retrieval pipeline, our score was approximately 85%. After: 93%. An 8-point gain from a retrieval path that was already in the codebase — we had been indexing on every store, but the index was not connected to the query pipeline.
Our precision-first strategy requires all non-stop-word terms to be present in a matching document. A query for "favorite Italian restaurant" requires all three content words, not an OR-based search that returns every document mentioning any of those words. This produces a small, high-precision result set that the neural reranker then scores. For proper nouns and multi-word phrases, this is dramatically more reliable than embedding similarity.
4. Deterministic Temporal Reasoning
One of LongMemEval's hardest categories is temporal reasoning — questions like "What was the last thing I mentioned about the project?" or "What did I say three conversations ago?" These require date math: comparing timestamps, ordering events, and computing relative time references.
We do not ask the LLM to do date arithmetic. Date math is computed in code. When the query contains temporal references, a preprocessing step resolves them to absolute date ranges, and retrieval is filtered accordingly. "Last Tuesday" becomes a specific date. "The conversation before the one about the merger" becomes a session-ID lookup. The LLM receives pre-filtered, correctly-ordered context and only needs to reason about content, not time.
5. Chain-of-Note Verification and Abstention
The system is explicitly allowed to say "I don't know." When retrieved evidence is thin, contradictory, or below a confidence threshold, the generation step produces an abstention rather than a hallucinated answer. This matters more than it sounds. On the LongMemEval benchmark, a wrong answer is scored as 0 — but so is no answer. The difference is that a hallucinated answer trains users to distrust the system, while an honest abstention preserves trust.
Our Chain-of-Note verification step examines the retrieved evidence before generation. If fewer than 2 high-relevance passages are found, or if the top-ranked passages contradict each other, the system flags the question as low-confidence. In practice, this means we get a small number of questions wrong by abstaining, but we almost never produce confident-sounding incorrect answers.
The Scoring Stack
LongMemEval specifies a particular evaluation method: GPT-4o as the judge. The model receives the question, the expected answer, and the system's response, then determines whether the response is correct. This is the official method from the paper, and we follow it without modification.
The full pipeline for each of the 500 questions:
- Ingest — a fresh namespace is created, and 40-60 conversation sessions are loaded (the scenario context for that question)
- Retrieve — all four retrieval paths run in parallel, results are fused and reranked by neural reranker
- Generate — Claude Opus produces an answer using the reranked evidence as context
- Judge — GPT-4o evaluates the generated answer against the ground truth
- Cleanup — the namespace is deleted, ensuring no data leakage between questions
Every question is answered against our production API — the same infrastructure a developer hits when they call the endpoint. No special benchmark mode, no cached results, no parameter tuning.
Per-Category Breakdown
The 97.2% overall score masks real variation across categories. Some are nearly solved. Others remain genuinely hard.
| Category | Score | Notes |
|---|---|---|
| Single-session (User) | 98.6% | Near-perfect recall of user statements |
| Single-session (Assistant) | 100% | Perfect recall of assistant-side memory |
| Single-session (Preference) | 90.0% | Preference extraction and retrieval |
| Knowledge update | 94.9% | Tracking evolving facts over time |
| Temporal reasoning | 86.5% | Date math and event ordering |
| Multi-session | 81.2% | Cross-conversation synthesis |
| Overall | 97.2% | 486 of 500 questions correct |
Where We Excel
Single-session recall is nearly solved. When a user states something in a conversation and asks about it later, round-level storage plus multi-path retrieval finds it with near-perfect reliability. The 100% score on assistant-side recall means the system remembers not just what the user said, but what the assistant said in response — a capability most memory systems ignore entirely.
Knowledge update tracking at 94.9% means the system correctly handles superseding information. When a user says "I moved to Berlin" after previously saying "I live in London," the system retrieves the most recent fact. This is powered by the entity graph, which maintains versioned attribute-value pairs with timestamps.
Where We Struggle
Temporal reasoning at 86.5% is the most improved category but still our second-weakest. Despite deterministic date math, some temporal questions require implicit ordering ("the conversation before that one") where the referent is ambiguous. We also see failures on counting questions — "How many times did I mention X?" requires exhaustive retrieval, and missing even one instance produces a wrong count.
Multi-session at 81.2% is the hardest category in the benchmark. These questions require synthesizing information from multiple separate conversations into a coherent answer. The retrieval challenge is that the answer may depend on three or four different sessions, and missing any one of them produces an incomplete response. This is where the entity graph helps most — it provides structural links between sessions that pure embedding search would miss — but there is still significant room for improvement.
What We Learned
Counting Is Harder Than Reasoning
The most surprising failure mode is counting. Questions like "How many different restaurants did I mention?" require exhaustive retrieval — every relevant memory must be found, with no misses. Semantic search is designed for relevance ranking, not exhaustive recall. Full-text search helps (it catches exact name matches that embeddings miss), but counting questions remain our highest error rate within temporal reasoning.
Abstention Is a Feature
We initially tuned for maximum coverage — trying to answer every question. This produced more wrong answers than right ones on edge cases. Switching to explicit abstention on low-confidence questions improved the overall score by approximately 2 points. The system now says "I don't have enough information to answer that" rather than guessing, and the benchmark rewards this honesty (a non-answer scores the same as a wrong answer, but it doesn't degrade user trust in production).
Retrieval Is the Bottleneck, Not Generation
When we analyze our 34 incorrect answers, the vast majority are retrieval failures — the right information was in the database but was not surfaced in the top results. Generation errors (where the right context was retrieved but the model produced a wrong answer) account for fewer than 5 of the 34 misses. This means improving retrieval — adding more paths, better reranking, smarter query expansion — will continue to push the score up.
Full-Text Search Is Underrated
The AI memory space has converged on vector embeddings as the default retrieval method. This is a mistake. Full-text search solves an entire class of queries — proper nouns, acronyms, exact phrases, technical terms — that embeddings handle poorly. The 8-point accuracy gain from wiring full-text search into our pipeline was the largest single improvement in our benchmark history. If you are building a memory system and only have embeddings, add a keyword index. It is the highest-ROI change you can make.
What Is Next
97.2% is the highest published score on LongMemEval that we are aware of. But 34 questions wrong out of 500 means there is meaningful room to improve, particularly in multi-session synthesis and temporal counting. Our current focus areas:
- Exhaustive retrieval mode for counting queries — a separate pipeline that prioritizes recall over precision when the question type is detected as counting
- Improved session-chain resolution for temporal references that depend on implicit ordering ("the one before that")
- Deeper entity graph integration — using graph traversal as a primary retrieval method for multi-session questions, not just a supplementary signal
We publish every result — pass or fail. If you want to see the raw numbers, the methodology, and the comparison against published baselines, the full benchmarks page has everything.
Reproducibility matters. Our benchmark harness runs against the production API — the same endpoint developers call. If you want to verify our numbers or run your own evaluation, the LongMemEval dataset is publicly available from the original paper.
Build with 97.2% recall accuracy
The same memory engine behind these benchmarks, available as an API.
Get started free →