Five LongMemEval questions running against our production API right now. Not a cached number. Not a screenshot. Real API calls, real scores.
This page runs a subset of the LongMemEval benchmark (ICLR 2025) against our live production API.
Each test follows the same protocol: store a fact via /v1/remember, then query it back via /v1/recall.
A question passes if the recalled answer contains the expected key information.
The 5 questions below are sampled from different LongMemEval categories: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and preference tracking. The full benchmark (500 questions, 97.2% accuracy) is documented on our benchmarks page.
All API calls use a temporary benchmark namespace that is cleaned up afterward. No test data persists in production. Scoring uses exact substring matching (same as the published LongMemEval protocol).