First, the scope: this is about the memory an AI agent keeps — the store an LLM writes facts to and reads back on a later turn — not memory-safety, and not RAM. That market is optimizing the wrong thing. It is a recall contest — LongMemEval, LoCoMo, BEAM — measuring how much an LLM can remember from a conversation. For an agent that acts on the world through tools, that is the second question. The first is: what does memory look like when you assume it will be poisoned. Get that wrong and your recall score is the speed at which an attacker's text becomes a remembered fact.
The benchmarks are a product decision, not a measurement
First, clear the table. The eye-catching numbers — 91–96% on LongMemEval/LoCoMo — are vendor self-reported, on the vendor's own harness, with the vendor's own LLM actor and prompts. Every vendor page ranks its own tool first. The one figure with independent provenance is Zep 63.8% vs Mem0 49.0% on temporal retrieval (arXiv 2501.13956), and the signal there is the level: when a neutral harness runs these systems, scores drop 20–30 points below the blog numbers. A confidently ranked memory leaderboard is marketing. Use the architectures; ignore the scoreboard.
A field guide: what each memory system actually does
Strip the marketing and the field is four families: managed APIs (Mem0, Supermemory), temporal knowledge graphs (Zep / Graphiti), OS-style self-editing (Letta / MemGPT), and local-first / sovereign (Cognee, Memori, MemPalace, OMEGA). Below, what each does in essence — the actual mechanism, kept as a reference — and the one idea from each worth taking. (Framework building blocks — LangMem, LlamaIndex Memory, Kernel Memory — are toolkits, not products; Pinecone/Qdrant/Milvus are storage backends. Half the "memory systems" in circulation are one of those plus a thin extraction layer.)
| System | In essence (the mechanism) | Worth stealing |
|---|---|---|
| Mem0 | An LLM reads each exchange, extracts discrete facts, and dedupes/merges them into a flat store — memory as distilled facts, not transcript. Drop-in API. | Consolidation: dedupe + merge on write, so the store doesn't bloat with near-duplicate facts. |
| Zep / Graphiti | Facts live in a graph of entities and relations; every edge is stamped with valid-time (true in the world) and transaction-time (when the agent learned it). Costs a Neo4j dependency. | Bi-temporal facts: "what did we believe X was, and when" — a different shape than a flat truth table. |
| Letta / MemGPT | Two tiers — core memory in-context, archival out of it — and the agent pages between them with explicit read/write tools. Context is RAM, archive is disk. | Mutation as tool calls: every write is named, audited, gateable — not a side effect of chatting. |
| Honcho | Every participant — user, agent, sub-agent — is a "peer" with its own scoped representation, instead of fixed user_id/session_id/agent_id. Plus "Dreaming": background reasoning over stored memory. | Peers over roles. (But reject Dreaming as shipped — see below.) |
| Hindsight | Memory typed by origin — "what is true" vs "what I observed" vs "what I was told" — kept structurally distinct. | Fact / belief / input separation. The single highest-value idea in the field. |
| Cognee | An extract→cognify→load pipeline: a structuring pass lifts raw data into a knowledge graph + vectors. Local. | A "cognify" pass that promotes raw rows into typed entities/relations. |
| Memori | Memory in plain SQL with two ingestion modes — "conscious" (a curated working set) and "auto" (everything) — promoting the important items. | Dual ingestion: a curated working set kept distinct from the firehose. |
| OMEGA / MemPalace | SQLite + local embeddings + MCP tools, encrypted at rest, no external services. | Sovereignty as architecture, not a checkbox. |
The one to reject: unsupervised background reasoning over un-vetted content (Honcho's "Dreaming," Mem0's auto-extraction in raw form). Memory written without provenance and reasoned over without a gate turns one poisoned tool result into a memory-wide blast radius, asynchronously. Async reflection is fine — but only over provenance-verified content, writing through the same gated tools.
Where Zoya stands now
Zoya is not shopping for a memory layer. Under the hood it isn't one store but several distinct kinds of memory — and the honest cut is by what's actually read at retrieval, not what's merely written:
| Memory | What it is | Status |
|---|---|---|
Enriched memories | Distilled, importance-scored facts and exchanges; keyword (FTS5/BM25) recall, trust-filtered, and on the live path reranked by BM25 + importance + recency. | Live — was dark until the bm25 fix below. |
conversation_log | Full, untruncated exchanges, FTS5-searchable. | Live |
message_ledger | Per-message log with direction, delivery status and platform id; powers gap / timeline surfacing on resume. | Live |
| Typed registries | Structured records over a generic registry table — contacts, events, decisions, places, reading list, and goals (active goals injected into context each turn). | Live |
| Semantic (local ternary embeddings) | Char-trigram vector similarity, computed locally, no embedding API. | Tool-only — written every store, read only via memory_search; a measured-weak signal, not yet in the live path (Phase 2 step 2). |
Reinforcement (access_count) | Retrieved rows are bumped so frequently-used memory survives decay. | Live — wired in Phase 0. |
And what's not there by design: Zoya carried five inert "psychological layers" (narrative threading, a self-model, a working-memory buffer, a goals table, a never-built dual-process flag) that wrote rows nothing ever read. A wiring review showed each useful form already lived in a better mechanism — the ledger, per-user profiles, the new recency rerank — so they were dropped; goal-tracking was rebuilt for real as the registry above. Honesty about what runs is the same discipline as provenance: don't ship a layer you can't point at.
Against the five patterns from the field guide:
| Pattern | Status in Zoya |
|---|---|
| Explicit, audited, gateable mutation tools | Have it — memory_store/*_update are tools; each call is source-classified into the audit log and can be put behind the approval gate. |
| Local-first sovereign deployment | Is it — SQLite + local embeddings, self-hosted pod, no external memory service. |
| Peers, not fixed roles | Partial — sender / channel / topic + per-sender and group profiles + sub-agents, but roles are hardcoded, not generalized. |
| Fact / belief / input separation | Have it now — memories carries a trust class (0 quarantined / 1 normal / 2 trusted) + a source label, derived at ingest from group / suspicious-input / live taint / admin-channel signals. Auto-injection retrieves min_trust=1, so untrusted content is kept out of the prompt (still visible to the search tool). Belief vs. raw input is a column, not a convention. (Shipped after this post — see below.) |
| Bi-temporal facts | Gap — rows carry created_at/updated_at only. Transaction-time, no valid-time. |
The part the memory market doesn't have at all: a threat model. Zoya already ships a 16-pattern input sanitizer, taint tracking with sink gating, an audit log, output-leak detection, and HITL approval gates. Mem0, Zep, Letta, Honcho treat memory as a recall problem and ship none of this. Zoya's edge is not better recall — it is that every memory mutation is already an inspectable, classifiable, gateable event.
The move, made
Update, same day: the column is written. What was "one column away" is now in the schema and in the hot path — and getting there proved the rest of this post better than the argument did.
The strongest idea in the whole survey — separate believed-fact from observed-input, or injection-through-memory works — needed the source taint that Zoya already computes (classifySource) to land on the memory row instead of only the audit log. It does now:
memoriesgained atrustclass (0 quarantined / 1 normal / 2 trusted) and asourcelabel, derived at ingest from the signals already in hand — group chat, a flagged-suspicious input, live session taint (external_net/shell_output/file_data), or an admin channel.- Untrusted content is still stored — it stays retrievable through the explicit
memory_searchtool for inspection — but it is quarantined from auto-injection: the per-turn context builder queries withmin_trust=1, so a web result or a stranger's message can no longer ride into the prompt as a remembered fact. - The gate is a test, not a vibe: store a poisoned "fact" from an untrusted source, then assert it is excluded from auto-injection yet still retrievable on demand.
And here is the part that earns the title. Wiring the trust predicate meant running the retrieval path for real — and that path was silently returning nothing. The live query joins the FTS index, ranks by bm25(), and groups by row to gather tags; SQLite forbids bm25() in a GROUP BY query and rejects it at execution time, so the result loop simply never ran. The enriched-memory tier — the one both auto-injection and the search tool read — had been handing back an empty list. It was masked because the conversation-log and ledger tiers (no GROUP BY) still answered, so recall looked fine. The recall everyone benchmarks was dark, and nobody noticed, because nothing was measuring whether it was on. That is the whole thesis in one bug: a confident recall score tells you nothing about whether your memory is correct, trusted, or even running. Provenance and a verification test do. (The fix: compute bm25() in a flatten-proof subquery, then aggregate.)
Bi-temporal valid-time is the next move, and only if evolving-fact tracking (SOAR-style "when did we know this indicator was malicious") becomes a real workload. Provenance came first — and it came with a reminder to verify the plumbing before admiring the architecture.
What we took, what's next to steal
Mapping the field guide onto the build — the point of the survey is what crosses into the code:
- Taken. Fact/belief/input separation (Hindsight) → the
trust/sourcecolumns + quarantine. Mutation-as-tool-calls (Letta) →memory_store/*_update, audited and gateable. Sovereignty (OMEGA/MemPalace) → SQLite + local embeddings, self-hosted. A BM25 + importance + recency rerank on the live path — standard IR, measured into place. And consolidation (Mem0), the safe half: ingest hygiene drops[SILENT]/empty exchanges before they become memories, and exact-content dedup refreshes a repeated exchange instead of storing it twice — noise cut at the source, no risk of conflating two facts. - Next to steal. The risky half of consolidation — near-duplicate (semantic) merge — is held until the eval can prove it won't fuse two distinct facts ("my number is …0100" vs "…0199"). Bi-temporal facts (Graphiti) and peers over roles (Honcho) are wanted but gated: build them when evolving-fact tracking, or sub-agent fan-out, becomes a real workload rather than a speculative one.
- Rejected. Ungated background reasoning (Dreaming, raw auto-extraction). Reasoning over un-vetted memory is the injection vector; it doesn't become safe by being asynchronous.
And the discipline underneath all of it: every move is made against a small Zoya-specific recall eval (zig build eval) that runs the real retrieval code over labelled cases. The rerank is "done" because the number moved — 0.58 → 0.75 — with keyword recall and the trust quarantine held flat, not because it felt better. Steal mechanisms, not benchmarks; then prove the steal on your own workload.
The opening
The consumer memory market is solving recall on conversational corpora. That is not the hard problem for an agent that touches the world. Memory as a security-operations primitive — provenance, audit, bi-temporal facts, HITL-gated mutation, fact/belief separation — is the gap nobody is filling. Zoya is closer to it than any product on the leaderboard, because it started from a threat model instead of a benchmark. The right move is not to pick a winner. It is to keep building — the provenance column is written now; the bi-temporal one is next.