Agent memory is an attack surface

First, the scope: this is about the memory an AI agent keeps (the store an LLM writes facts to and reads back on a later turn), not memory-safety, and not RAM. That market is optimising the wrong thing. It is a recall contest (LongMemEval, LoCoMo, BEAM), measuring how much an LLM can remember from a conversation. For an agent that acts on the world through tools, that is the second question. The first is: what does memory look like when you assume it will be poisoned. Get that wrong and your recall score is the speed at which an attacker's text becomes a remembered fact.

Collapse "what is true" and "what was fed in" into one store and memory-injection is a feature. Keep them separate and it's a control.

The benchmarks are a product decision, not a measurement

First, clear the table. The eye-catching numbers (91 to 96% on LongMemEval/LoCoMo) are vendor self-reported, on the vendor's own harness, with the vendor's own LLM actor and prompts. Every vendor page ranks its own tool first. The one figure with independent provenance is Zep 63.8% vs Mem0 49.0% on temporal retrieval (arXiv 2501.13956), and the signal there is the level: when a neutral harness runs these systems, scores drop 20 to 30 points below the blog numbers. A confidently ranked memory leaderboard is marketing. Use the architectures; ignore the scoreboard.

A field guide: what each memory system actually does

Strip the marketing and the field is four families: managed APIs (Mem0, Supermemory), temporal knowledge graphs (Zep / Graphiti), OS-style self-editing (Letta / MemGPT), and local-first / sovereign (Cognee, Memori, MemPalace, OMEGA). Below, what each does in essence (the actual mechanism, kept as a reference) and the one idea from each worth taking. (Framework building blocks, such as LangMem, LlamaIndex Memory, and Kernel Memory, are toolkits, not products; Pinecone/Qdrant/Milvus are storage backends. Half the "memory systems" in circulation are one of those plus a thin extraction layer.)

System	In essence (the mechanism)	Worth stealing
Mem0	An LLM reads each exchange, extracts discrete facts, and dedupes/merges them into a flat store: memory as distilled facts, not transcript. Drop-in API.	Consolidation: dedupe + merge on write, so the store doesn't bloat with near-duplicate facts.
Zep / Graphiti	Facts live in a graph of entities and relations; every edge is stamped with valid-time (true in the world) and transaction-time (when the agent learned it). Costs a Neo4j dependency.	Bi-temporal facts: "what did we believe X was, and when", a different shape than a flat truth table.
Letta / MemGPT	Two tiers (core memory in-context, archival out of it) and the agent pages between them with explicit read/write tools. Context is RAM, archive is disk.	Mutation as tool calls: every write is named, audited, gateable, not a side effect of chatting.
Honcho	Every participant (user, agent, sub-agent) is a "peer" with its own scoped representation, instead of fixed `user_id/session_id/agent_id`. Plus "Dreaming": background reasoning over stored memory.	Peers over roles. (But reject Dreaming as shipped: see below.)
Hindsight	Memory typed by origin ("what is true" vs "what I observed" vs "what I was told"), kept structurally distinct.	Fact / belief / input separation. The single highest-value idea in the field.
Cognee	An extract→cognify→load pipeline: a structuring pass lifts raw data into a knowledge graph + vectors. Local.	A "cognify" pass that promotes raw rows into typed entities/relations.
Memori	Memory in plain SQL with two ingestion modes, "conscious" (a curated working set) and "auto" (everything), promoting the important items.	Dual ingestion: a curated working set kept distinct from the firehose.
OMEGA / MemPalace	SQLite + local embeddings + MCP tools, encrypted at rest, no external services.	Sovereignty as architecture, not a checkbox.

The one to reject: unsupervised background reasoning over un-vetted content (Honcho's "Dreaming," Mem0's auto-extraction in raw form). Memory written without provenance and reasoned over without a gate turns one poisoned tool result into a memory-wide blast radius, asynchronously. Async reflection is fine, but only over provenance-verified content, writing through the same gated tools.

Where Zoya stands now

Zoya is not shopping for a memory layer. Under the hood it isn't one store but several distinct kinds of memory, and the honest cut is by what's actually read at retrieval, not what's merely written:

Memory	What it is	Status
Enriched `memories`	Distilled, importance-scored facts and exchanges; keyword (FTS5/BM25) recall, trust-filtered, and on the live path reranked by BM25 + importance + recency.	Live, was dark until the bm25 fix below.
`conversation_log`	Full, untruncated exchanges, FTS5-searchable.	Live
`message_ledger`	Per-message log with direction, delivery status and platform id; powers gap / timeline surfacing on resume.	Live
Typed registries	Structured records over a generic registry table: contacts, events, decisions, places, reading list, and goals (active goals injected into context each turn).	Live
Semantic (local ternary embeddings)	Char-trigram vector similarity, computed locally, no embedding API.	Tool-only: written every store, read only via `memory_search`; a measured-weak signal, not yet in the live path (Phase 2 step 2).
Reinforcement (`access_count`)	Retrieved rows are bumped so frequently-used memory survives decay.	Live, wired in Phase 0.

And what's not there by design: Zoya carried five inert "psychological layers" that wrote rows nothing ever read. A wiring review showed each useful form already lived in a better mechanism, so they were dropped. Honesty about what runs is the same discipline as provenance: don't ship a layer you can't point at.

Against the five patterns from the field guide:

Pattern	Status in Zoya
Explicit, audited, gateable mutation tools	Have it: `memory_store`/`*_update` are tools; each call is source-classified into the audit log and can be put behind the approval gate.
Local-first sovereign deployment	Is it: SQLite + local embeddings, self-hosted pod, no external memory service.
Peers, not fixed roles	Partial: sender / channel / topic + per-sender and group profiles + sub-agents, but roles are hardcoded, not generalised.
Fact / belief / input separation	Have it now: `memories` carries a `trust` class (0 quarantined / 1 normal / 2 trusted) + a `source` label, derived at ingest from group / suspicious-input / live taint / admin-channel signals. Auto-injection retrieves `min_trust=1`, so untrusted content is kept out of the prompt (still visible to the search tool). Belief vs. raw input is a column, not a convention. (Shipped after this post, see below.)
Bi-temporal facts	Gap: rows carry `created_at`/`updated_at` only. Transaction-time, no valid-time.

The part the memory market doesn't have at all: a threat model. Zoya already ships a 16-pattern input sanitizer, taint tracking with sink gating, an audit log, output-leak detection, and HITL approval gates. Mem0, Zep, Letta, Honcho treat memory as a recall problem and ship none of this. Zoya's edge is not better recall: it is that every memory mutation is already an inspectable, classifiable, gateable event.

The move, made

Update, same day: the column is written. What was "one column away" is now in the schema and in the hot path, and getting there proved the rest of this post better than the argument did.

The strongest idea in the whole survey, separate believed-fact from observed-input, or injection-through-memory works, needed the source taint that Zoya already computes (classifySource) to land on the memory row instead of only the audit log. It does now:

memories gained a trust class (0 quarantined / 1 normal / 2 trusted) and a source label, derived at ingest from the signals already in hand: group chat, a flagged-suspicious input, live session taint (external_net / shell_output / file_data), or an admin channel.
Untrusted content is still stored (it stays retrievable through the explicit memory_search tool for inspection) but it is quarantined from auto-injection: the per-turn context builder queries with min_trust=1, so a web result or a stranger's message can no longer ride into the prompt as a remembered fact.
The gate is a test, not a vibe: store a poisoned "fact" from an untrusted source, then assert it is excluded from auto-injection yet still retrievable on demand.

And here is the part that earns the title. Wiring the trust predicate meant running the retrieval path for real, and that path was silently returning nothing. The live query joins the FTS index, ranks by bm25(), and groups by row to gather tags; SQLite forbids bm25() in a GROUP BY query and rejects it at execution time, so the result loop simply never ran.

The enriched-memory tier (the one both auto-injection and the search tool read) had been handing back an empty list. It was masked because the conversation-log and ledger tiers (no GROUP BY) still answered, so recall looked fine. The recall everyone benchmarks was dark, and nobody noticed, because nothing was measuring whether it was on.

That is the whole thesis in one bug: a confident recall score tells you nothing about whether your memory is correct, trusted, or even running. Provenance and a verification test do. (The fix: compute bm25() in a flatten-proof subquery, then aggregate.)

Bi-temporal valid-time is the next move, and only if evolving-fact tracking (SOAR-style "when did we know this indicator was malicious") becomes a real workload. Provenance came first, and it came with a reminder to verify the plumbing before admiring the architecture.

What we took, what's next to steal

Mapping the field guide onto the build, the point of the survey is what crosses into the code:

Taken. Fact/belief/input separation (Hindsight) → the trust/source columns + quarantine. Mutation-as-tool-calls (Letta) → memory_store/*_update, audited and gateable. Sovereignty (OMEGA/MemPalace) → SQLite + local embeddings, self-hosted. A BM25 + importance + recency rerank on the live path: standard IR, measured into place. And consolidation (Mem0), the safe half: ingest hygiene drops [SILENT]/empty exchanges before they become memories, and exact-content dedup refreshes a repeated exchange instead of storing it twice, noise cut at the source, no risk of conflating two facts.
Next to steal. The risky half of consolidation, near-duplicate (semantic) merge, is held until the eval can prove it won't fuse two distinct facts ("my number is …0100" vs "…0199"). Bi-temporal facts (Graphiti) and peers over roles (Honcho) are wanted but gated: build them when evolving-fact tracking, or sub-agent fan-out, becomes a real workload rather than a speculative one.
Rejected. Ungated background reasoning (Dreaming, raw auto-extraction). Reasoning over un-vetted memory is the injection vector; it doesn't become safe by being asynchronous.

And the discipline underneath all of it: every move is made against a small Zoya-specific recall eval (zig build eval) that runs the real retrieval code over labelled cases. The rerank is "done" because the number moved (0.58 to 0.75) with keyword recall and the trust quarantine held flat, not because it felt better. Steal mechanisms, not benchmarks; then prove the steal on your own workload.

The opening

The consumer memory market is solving recall on conversational corpora. That is not the hard problem for an agent that touches the world. Memory as a security-operations primitive (provenance, audit, bi-temporal facts, HITL-gated mutation, fact/belief separation) is the gap nobody is filling. Zoya is closer to it than any product on the leaderboard, because it started from a threat model instead of a benchmark. The right move is not to pick a winner. It is to keep building: the provenance column is written now; the bi-temporal one is next.

Agent memory is the store an LLM writes facts to and reads back on a later turn, not memory-safety and not RAM. The market optimises recall (LongMemEval, LoCoMo, BEAM). For an agent that acts through tools, recall is the second question. The first: what does memory look like once you assume it will be poisoned?

The benchmarks are marketing

The headline 91 to 96% scores are vendor self-reported, on the vendor's own harness, and each ranks itself first. The one independently run figure, Zep 63.8% vs Mem0 49.0% (arXiv 2501.13956), sits 20 to 30 points below. Use the architectures, ignore the scoreboard.

A confident recall score tells you nothing about whether your memory is correct, trusted, or even running.

What to steal, what to reject

Strip the labels and four families remain: managed APIs, temporal knowledge graphs, OS-style self-editing, local-first / sovereign. Worth taking:

Fact / belief / input separation (Hindsight). Type each memory by origin: true, observed, or told. The highest-value idea in the field.
Mutation as tool calls (Letta). Every write named, audited, gateable.
Bi-temporal facts (Zep / Graphiti). Stamp each fact with valid-time and transaction-time.
Consolidation (Mem0). Safe half: exact dedupe on write. Hold near-duplicate merge until an eval proves it won't fuse two facts.
Sovereignty (OMEGA / MemPalace). SQLite plus local embeddings, no external service.

Reject one thing: ungated reasoning over un-vetted memory (Honcho's "Dreaming", raw auto-extraction). That is the injection vector; async does not make it safe.

Provenance is a column, not a convention

Collapse "what is true" and "what was fed in" into one store and memory-injection is a feature: a stranger's message rides into the prompt as a remembered fact. The consumer market ships no threat model.

Zoya now tags each row with a trust class (0 quarantined / 1 normal / 2 trusted) plus a source label, set at ingest from group chat, flagged input, or live taint. Untrusted content stays searchable, but auto-injection queries min_trust 1, so it cannot ride in. The gate is a test, not a vibe: store a poisoned fact, assert it stays out of the prompt.

Verify the plumbing before admiring the architecture

Wiring the trust predicate meant running retrieval for real, and it returned nothing. The enriched query ranks by bm25() and groups by row; SQLite forbids bm25() in a GROUP BY, so the loop never ran. That tier returned an empty list, masked because other tiers answered. The recall everyone benchmarks was dark, and nothing measured whether it ran.

Every move runs against a Zoya recall eval: the rerank is "done" because the number moved (0.58 to 0.75), not because it felt better. Steal mechanisms, not benchmarks; prove it on your workload.

Sources

Zep / Graphiti temporal-retrieval benchmark, arXiv 2501.13956: the one figure with independent provenance (Zep 63.8% vs Mem0 49.0%), and the level neutral harnesses report once vendor framing is removed.
Mem0, Supermemory: managed memory APIs; the source of consolidation (dedupe and merge on write) worth taking in its safe form.
Letta / MemGPT: OS-style paging between in-context and archival memory through explicit read/write tools; the model for mutation as audited tool calls.
Hindsight: memory typed by origin (true / observed / told); the fact/belief/input separation that turns memory-injection from a feature into a control.
Honcho: peers over fixed roles, and "Dreaming" background reasoning, kept here as the pattern to reject until the content it reasons over is provenance-verified.