Integrations
Tutorial
April 13, 2026
Ultra-Fast AI Memory with Groq + REM Labs
Groq's LPU inference engine delivers tokens faster than any other provider. But speed without memory means your agent forgets everything between requests. This guide shows how to pair Groq's blazing-fast inference with REM Labs persistent memory -- so your agent responds in milliseconds and remembers across sessions.
Why Groq + Persistent Memory
Groq is built for speed. Their custom LPU hardware delivers hundreds of tokens per second on models like Llama 3 and Mixtral. Developers choose Groq when latency matters -- real-time chat, voice interfaces, live coding assistants. But the Groq API is stateless, just like every other LLM API. Fast responses without context are just fast guesses.
Adding REM Labs to a Groq agent means you get sub-100ms memory retrieval on top of Groq's sub-second inference. The total round trip -- recall memories, generate response, store new context -- typically stays under 1.5 seconds. That is fast enough for real-time conversational AI that actually remembers.
Step 1: Get Your API Keys
Get a Groq API key from console.groq.com and a REM Labs key from remlabs.ai/console (or run npx @remlabs/memory). Both have free tiers.
Step 2: Build a Groq Agent with Memory
from groq import Groq
import requests
groq = Groq(api_key="gsk_...")
REM_KEY = "sk-slop-..."
REM_BASE = "https://api.api.remlabs.ai"
def chat(user_msg, namespace="groq-agent"):
# Recall relevant memories
search = requests.post(f"{REM_BASE}/v1/memory/search", json={
"query": user_msg,
"namespace": namespace,
"limit": 5
}, headers={"Authorization": f"Bearer {REM_KEY}"})
memories = search.json().get("results", [])
context = "\n".join([m["value"] for m in memories])
# Build messages with memory context
messages = []
if context:
messages.append({"role": "system", "content": f"Relevant context from prior conversations:\n{context}"})
messages.append({"role": "user", "content": user_msg})
# Fast inference via Groq
resp = groq.chat.completions.create(
model="llama-3.1-70b-versatile",
messages=messages
)
reply = resp.choices[0].message.content
# Store the new interaction
requests.post(f"{REM_BASE}/v1/memory-set", json={
"key": "groq-chat",
"value": f"User: {user_msg}\nAssistant: {reply}",
"namespace": namespace,
"tags": ["conversation"]
}, headers={"Authorization": f"Bearer {REM_KEY}"})
return reply
# Session 1
print(chat("I'm building a voice assistant for my smart home. I use Home Assistant."))
# Session 2
print(chat("What platform am I using for my smart home?"))
# "You mentioned you use Home Assistant for your smart home setup."
Step 3: Node.js with the Groq SDK
import Groq from "groq-sdk";
const groq = new Groq({ apiKey: "gsk_..." });
const REM_BASE = "https://api.api.remlabs.ai";
const REM_KEY = "sk-slop-...";
async function chat(userMsg, namespace = "groq-agent") {
// Recall
const search = await fetch(`${REM_BASE}/v1/memory/search`, {
method: "POST",
headers: { "Authorization": `Bearer ${REM_KEY}`, "Content-Type": "application/json" },
body: JSON.stringify({ query: userMsg, namespace, limit: 5 })
});
const { results } = await search.json();
const context = results.map(r => r.value).join("\n");
// Groq inference
const resp = await groq.chat.completions.create({
model: "llama-3.1-70b-versatile",
messages: [
...(context ? [{ role: "system", content: `Memories:\n${context}` }] : []),
{ role: "user", content: userMsg }
]
});
const reply = resp.choices[0].message.content;
// Store
await fetch(`${REM_BASE}/v1/memory-set`, {
method: "POST",
headers: { "Authorization": `Bearer ${REM_KEY}`, "Content-Type": "application/json" },
body: JSON.stringify({ key: "groq-chat", value: `User: ${userMsg}\nAssistant: ${reply}`, namespace })
});
return reply;
}
Latency Breakdown
Here is what the typical round trip looks like with Groq + REM Labs:
- Memory recall (REM Labs search): ~30-50ms
- LLM inference (Groq LPU): ~200-800ms depending on output length
- Memory store (REM Labs write, async): ~20-40ms
The memory store can be fired asynchronously so it does not block the response. Total perceived latency is Groq inference plus memory recall -- typically well under one second for most responses.
Groq uses OpenAI-compatible endpoints: If you are already using the OpenAI SDK pointed at Groq, the same memory pattern from our OpenAI integration guide works with zero changes.
Give your Groq agent a memory
Free tier. No credit card. Fast memory for fast inference.
Get started free →