Live Production Benchmark

Watch the API prove itself.

Five LongMemEval questions running against our production API right now. Not a cached number. Not a screenshot. Real API calls, real scores.

Initializing benchmark...

Each run creates a fresh namespace, stores memories, then recalls them.

Methodology

This page runs a subset of the LongMemEval benchmark (ICLR 2025) against our live production API. Each test follows the same protocol: store a fact via /v1/remember, then query it back via /v1/recall. A question passes if the recalled answer contains the expected key information.

The 5 questions below are sampled from different LongMemEval categories: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and preference tracking. The full benchmark (500 questions, 97.2% accuracy) is documented on our benchmarks page.

All API calls use a temporary benchmark namespace that is cleaned up afterward. No test data persists in production. Scoring uses exact substring matching (same as the published LongMemEval protocol).