We spent a while designing the execution plane for a security-automation fleet: many tenants, a flood of SIEM alerts, agents that triage and enrich and occasionally do something destructive like isolate a host. The argument we kept having was the loud one, which broker, which database. Every time we resolved it, we found the real decision had been sitting one step upstream the whole time, and the loud question fell out of the quiet one for free. This is that pattern, written down, because it generalizes well past security automation.

The shape of every decision here

  • The broker is downstream of where the actions land and where the data must legally live.
  • HA for the ledger is downstream of which state is correctness and which is convenience.
  • Surviving a cloud fault is downstream of where you put the commit boundary.
  • Push versus pull is downstream of the fact that a lost push is silent.
  • The thing you were arguing about is almost never the thing that decides the system.

The broker is downstream of where the actions land

We started, by inertia, assuming a managed cloud queue, because the conversation began with cloud-hosted alerts and everything downstream inherited that. That is the wrong order. The broker is not the first decision. It is what falls out of two earlier ones: where do the actions actually land, and where does the data legally have to live.

If the actions overwhelmingly target one cloud and the tenants are that cloud's accounts, stay native and lean all the way in: managed identity, per-tenant federated credentials, private endpoints. The native security model is worth more than any broker elegance, because otherwise you rebuild exactly that plumbing by hand. If the fleet is heterogeneous, many SIEMs and EDRs and OT sensors across many tenants, and you want a control plane you actually own, residency-controllable and self-hostable, the honest answer is a broker you run. We landed on NATS JetStream, and the rest of the design got cleaner because the tenant boundary became a primitive instead of something assembled from namespaces and access rules.

Answer where the actions land and where the data must live, and the broker stops being a debate. It is the consequence of those two answers, not an independent choice.

Account-per-tenant, and the hill to die on

The one decision that matters more than all the others: each tenant's actions execute under that tenant's own scoped credential, never a shared super-identity. A noisy neighbour is an SLA annoyance. A shared action identity is a single-worker-compromise-becomes-a-cross-client-breach event. So isolation has to be the default you would have to work to violate, not a discipline you remember to enforce.

NATS gives this as a primitive: one account per tenant. Accounts are the hard wall; subjects do not cross them without an explicit export. Tenant A's worker physically cannot subscribe to tenant B's subjects because they live in different accounts. Because the account already scopes everything, the tenant id disappears from the subject space, and the subjects are identical inside every account:

alerts.raw.<source>.<sev>          stream RAW         (workqueue)
alerts.normalized.<domain>.<sev>   stream NORMALIZED  (workqueue)
action.<domain>.<class>            stream ACTIONS     (workqueue)
approval.pending.<domain>          stream APPROVALS   (interest)
dlq.<stage>                        stream DLQ         (limits)

The same principle runs all the way down to the data layer: a worker scoped to a tenant's account connects to its store under a role scoped to that tenant too, so even a buggy worker cannot read another tenant's history. This is the same instinct as a human-in-the-loop approval gate, just moved to the credential layer: never let a component hold authority it does not need.

The bus is the nervous system, the ledger is the memory

A work-queue stream forgets a message the instant it is acked. That is correct, and it is exactly why the bus cannot be your source of truth. The bus moves what is happening; a durable ledger holds what happened. They do opposite jobs, which is why you need both.

The ledger exists to make at-least-once delivery safe to act on. Here is the failure it prevents: a worker calls the EDR to isolate a host, the call succeeds, and the worker crashes before it acks. The bus correctly redelivers. Without durable state you isolate the host twice, or worse if the action is not idempotent. The fix is a unique constraint around the side-effecting call.

create table action_ledger (
  id              bigserial primary key,
  tenant_id       text not null,
  source_alert    text not null,
  action_type     text not null,
  target          text not null,
  idempotency_key text not null,
  decision        jsonb not null,
  approved_by     text,
  status          text not null,   -- pending|executed|verified|failed
  result          jsonb,
  evidence_uri    text,            -- pointer, not the blob
  created_at      timestamptz default now(),
  unique (idempotency_key)         -- this line is the whole mechanism
);

The ledger is the thing that converts the bus's at-least-once delivery into effectively-once side effects. The broker structurally cannot do that for you. Only durable, constrained state can.

Everything else the ledger holds, case state, approvals, evidence pointers, is downstream of that one role. Raw payloads go to object storage with the ledger holding only a pointer; secrets go to a vault; hot ephemeral counters go to memory. The durable store is for facts that must be true and queryable in six months when an auditor asks what you did to a client's environment, and why.

Don't make SQLite highly available. Demote it.

For sovereign per-tenant appliances we wanted SQLite: one encrypted file, no daemon, trivial backup, clean offboarding. Then came the obvious question, how do you make it HA when the bus already is. The answer is that you do not. The moment you try to give SQLite active-active HA you are fighting its nature. You demote it instead.

Split the thing you were calling the ledger by how much losing it hurts.

StateWhere it livesWhy
Idempotency truth (have we done this action?)A replicated KV bucket on the busCorrectness. Atomic create / compare-and-swap gives "one writer wins this key," and it is HA for free.
Rich history (ledger, cases, approvals)Local SQLite, as a projection of an append-only event streamConvenience. A read model you are allowed to lose and rebuild by replaying the stream.

Now failover is just an ownership problem, solved with the bus and nothing else. Each tenant has one active owner holding a short-TTL lease key. If the owner dies the lease expires, a standby wins it with an atomic create (so no split-brain), restores the latest snapshot, replays the tail of the stream, resumes the durable consumer from its replicated ack position, and carries on. Reprocessing the in-flight messages is safe by construction, because any action already done is already a key in the HA KV, so the redelivery no-ops and re-verifies instead of firing twice.

This also dissolves the SQLite single-writer worry that started the whole database argument. Correctness never wanted many writers; it wanted exactly one authoritative writer per key, which at cluster scale is a KV compare-and-swap. You load-balance across tenants, never within a tenant. The per-tenant writer stays singular on purpose. That is not a limitation you work around, it is the invariant that keeps the thing correct.

Durability is a commit boundary

Then the real fear: a major cloud fault, the data cannot reach the bus. The instinct is to add a buffer. The sharper framing is that you are choosing a commit boundary, the point upstream of which an alert is not yet durably yours, and "survive a cloud fault" just means putting that boundary on the right side of the fault.

Commit boundaryYou lose data when...
Central busThe path to central breaks at all
A cloud bucketThe cloud region dies (so it does not survive a cloud fault)
An edge buffer near the sourceThe edge dies, which usually means the source died too
No new commit; track a cursor into the sourceThe source's own retention expires

Two consequences. First, the cheap default is the bottom row: most SIEMs retain and expose a time window, so you do not commit anything new, you just never advance your watermark past data that has not been durably accepted downstream. The bus comes back, you re-pull from the last honest watermark. The one discipline that makes this work is that the watermark advances on durable acceptance, never on a publish attempt. Advance on attempt and you get silent gaps.

Second, an in-cloud staging bucket against a cloud-wide fault is theatre, because the fault takes the bucket and the bus together. The thing that survives a cloud fault is something outside the failed cloud, which means an edge node, before the data crosses into the cloud at all. A leaf with its own local stream commits the alert the instant it is produced, accumulates while the uplink is down, and drains upward when it heals. A bucket earns its place only when the source exports files natively or you want cross-region staging, and then only in a different failure domain than the bus, committing on the write, with a real cursor.

Durability and the ability to act are separate problems. Edge commit means no alert is lost. It does not mean you can act during the outage, because the workers are still on the other side of the fault. Acting through an outage is a per-contract decision, not a default.

Push is the fast path, pull is the always-on safety net

Sources are mixed: some push, some pull, some on-prem, some cloud. Push is the right primary where you want fast response. But the pull path on a push-primary source cannot be reactive, because a lost push is silent. The SIEM fired its webhook, believes it delivered, and you never knew the alert existed. There is no error to catch in the moment.

So pull is not something you turn on when something looks off. It is a standing reconciliation loop, always running on a relaxed cadence, sweeping the source on a watermark and backfilling whatever push did not deliver. Push is the fast path; pull is the slow always-on net underneath it that finds the trouble you would otherwise never see.

PathGuaranteeRole
PushAt-most-once, but fastSeconds of latency for the alerts that need it
PullAt-least-once, eventuallyContinuous backstop; nothing is ever permanently missed

The reason you can run both without double-processing is the idempotency key from earlier. The same alert arriving twice, once fast via push and once later via reconciliation, collides on its key and the second no-ops. That is what makes the hybrid legal, and it means push and pull never have to coordinate: keep their cursors independent and let dedup absorb the overlap. Whether a given source can have a pull backstop at all is a capability of that source, which is exactly the sort of thing that belongs in a declarative per-tenant config rather than wired by hand.

The broker was a connectivity decision, not a queue decision

The most clarifying moment came late. The original pull toward a managed cloud queue was not about messaging semantics at all. It was two connectivity facts: we did not want inbound ports on our infrastructure, and we could not reach every source to pull. Both are real. Neither is a queue reason.

A managed broker is attractive because both ends dial out to a door someone else operates. But it does not solve the unreachable-source problem, because something still has to get an alert out of an unreachable network and into the broker, and that something is an agent inside the client network publishing outbound. Which is functionally a leaf collector. The two concerns collapse into one answer: edge collection plus outbound-only transport. The broker is just the destination.

Seen that way, the leaf model answers both concerns, and better. Clients expose zero inbound; the leaf dials out and the source only ever talks to something on its own LAN. Unreachable sources stop being unreachable because you are no longer reaching in. The one thing a managed broker still gives that self-hosting does not is that the exposed rendezvous endpoint is someone else's to operate. So the real fork is narrow and worth stating plainly:

Are you willing to operate one hardened, mutually-authenticated ingress door yourself, or do you specifically need that exposed surface to be someone else's responsibility? That, not push-versus-pull and not one broker versus another, is the actual decision.

If it is the former, self-host and own one well-defined door that completes no handshake for an unauthenticated scanner. If it is the latter, use a managed instance of the same ecosystem so you stay portable, or accept a managed broker at the edge as a DMZ and pull from it into your own plane the instant alerts land. Either way the processing plane stays yours.

The part you were not arguing about

One last upstream move, because it is the one that decides whether any of this scales. The substrate, the broker and the workers, was never the expensive part. The forever-cost is the per-tenant configuration: which sources a tenant has, its credential references, its enabled action classes, its concurrency limits, its approval policy. Get that one declarative, git-backed object right and the whole topology, the queues, the workers, the ingestion paths, derives from it instead of being designed per client. Get it wrong and you drown in per-tenant snowflakes at client number fifty.

That is the pattern in one line. The loud decision, the broker, the database, the protocol, is almost always the consequence of a quieter one you have not named yet: where the actions land, which state is correctness, where the commit boundary sits, what a lost message looks like, who owns the exposed door. Name the upstream decision and the downstream one stops being an argument. This is the same instinct as the one in the Buddy System: the format was never the problem, ownership was. Find the real seam and the rest falls out.


Drawn from designing a multi-tenant security-automation plane at fleet scale. The specific calls here (a self-hosted bus, account-per-tenant, ledger-as-truth, demoted SQLite) are context-dependent trades, not universal law. The transferable part is the habit: when a tooling choice will not settle, look one step upstream for the decision it is actually downstream of.