The syslog you can't get back

The brief

A SOC pipeline went dark. A sensorbox handling UDP syslog for one client had filled its /var overnight; by morning the disk was healthy again, log rotation having caught up, and the system was processing events normally. The gap, three hours of one client's stream, was real and silent. Nothing in the pipeline could recover it from where it was. We weren't part of the original design. We were called in after the fact, with three deliverables on the table: recover what could be recovered, write up the root cause, propose what should change so this stops happening.

This is the trip report. It generalises. UDP-syslog outages are a category rather than a one-off, and investigation jobs like this resolve the same way most of the time.

The investigation

The first hour was diagnostic. Three questions: how much was lost, when exactly the loss started and ended, and what state the box had been in during the outage.

/var was at 64% by the time we logged in, comfortably clear of trouble. The journal told the story:

journalctl -u rsyslog --since "2 days ago" \
    | grep -iE "no space|enospc|queue|dropped"
df -h /var
ls -la /var/log/ | head

The fillup had happened around 02:30 UTC. The daemon stayed up but couldn't persist new messages. Writes failed with ENOSPC, the action queue backed up to its memory bound, and within minutes the kernel was dropping UDP packets at the socket. Recovery began at 05:40 when logrotate's daily run trimmed enough older content to free disk; the daemon flushed its queue and resumed normal operation. No human action was involved. The system had self-recovered, which is part of why nobody noticed.

We pulled the existing monitoring config while we were there. Prometheus alerts existed for the root filesystem and for the named application volumes, with thresholds at 85% and 95%. /var was on its own partition under the operator's standard build, and no rule explicitly named it. The monitoring had been working as configured for years; it had simply never been asked about /var.

Even if it had been, the fillup was fast enough (a misbehaving application started producing several gigabytes per minute around 02:25) that a paged engineer probably could not have logged in and cleaned up before the daemon stopped persisting. Both layers of "this should have been caught" turned out to be wishful thinking under the actual incident's timing.

A pageable alert at 85% buys you forty minutes against a steady fill and zero against a runaway one. Monitoring was never going to catch this.

That gave the outage window: 02:30 to 05:40 UTC, three hours and ten minutes. The sender side was a fleet, not a single box. Dozens of application hosts emitted via UDP to our sensorbox while also writing locally to their own log files. The OS socket on each host was the only buffer between the application and the wire, and that buffer was small. Packets emitted during the outage were transmitted, dropped at our receiver, and never seen on any sender's network stack again.

The local files on each host were a different story. Most of the fleet retained the original records in their native paths: /var/log/syslog, application-specific files, the usual layout. The forwarding had failed. The local writing had not. Recovery was going to be possible, but it was going to be per-host, slicing each machine's log file separately, rather than one cut from a central archive.

Phase 1: pulling the data back

UDP syslog is a contract that doesn't promise delivery. The sender writes the packet, hands it to the kernel, and never hears back. There is no retry, no acknowledgement, no sequence number. Packets dropped during the outage are erased from the network path. The only recovery available is to read the equivalent records from wherever else they happen to exist (in this case, the local log files on each origin host) and ship them again, in a way that lands correctly the second time.

Step one was to pin the window to the second. We gave the client 02:28 UTC as the lower bound (a couple of minutes before the first failure in our logs) and 05:42 UTC as the upper bound (a couple of minutes after the daemon's first successful write post-recovery). Off by a minute either way is fine. Off by an hour produces duplicates or gaps the team will fight about later.

Step two was choosing a transport. With the source data spread across dozens of hosts rather than centralised in one archive, the choice was effectively between two shapes: pull or push. Pull meant SCP'ing the time-windowed log slice from every host to a single staging area on our side and feeding it through a file source. Push meant running a small one-shot forwarder on each host that read the local file, filtered by timestamp, and shipped over a reliable transport.

We went with push for three reasons. The fleet already had a configuration-management agent we could use to drop a temporary config; pulling would have needed credentialed access to every host. Push isolated each host's progress: one stuck transfer didn't block the others. And push let us use the sender's own clock as the timestamp source, which mattered for the next step.

The shape of the push:

On each origin host, a temporary rsyslog config with an imfile source pointed at the local log file, with a startmsg.regex matching syslog line headers and a readMode appropriate for the file format.
A filter rule that dropped any line whose embedded timestamp fell outside the 02:28-05:42 UTC window.
An action sending to a dedicated TCP port (RFC 6587) on the sensorbox, with the replay=true tag added to each message.
A one-shot lifecycle: the config was deployed, the daemon HUP'd, the window flushed, the config removed.

Other options we considered and discarded: telling each host to UDP-replay over the original port (re-creates the failure mode that started this), forwarding through the existing application-level syslog relay (would have re-mixed the streams and lost the per-host attribution we needed for dedup), and bulk file ingest via the SIEM's direct API (would have required onboarding every host's authentication, when the sensorbox already had the SIEM credential).

Landing the replay correctly

Step three was timestamps. Replayed messages arrive today but represent yesterday's events. By default, most syslog-to-SIEM pipelines stamp ingestion time onto every record. Without configuration changes, the replay would land as a spike today and the dashboards would still show a gap during the actual outage window. The fix is to extract the timestamp from the message body, not from the receive time. In rsyslog this is the difference between $timegenerated and $timereported; we wanted $timereported. The SIEM-side parser was already pointing at the embedded timestamp field, which made the downstream a non-issue. Other deployments are not always so lucky.

Step four was tagging. Every replayed record got replay=true and replay_batch=<incident-id> at ingest. Two reasons. Analysts looking at events in the recovered window needed to know what was fresh observation versus backfill. And the morning's dashboards needed to filter the replay batch out of "events in the last hour" rollups so the metrics didn't look like a SOC traffic spike.

Step five was deduplication. We were conservative on the window, so some events would land twice. We turned on content-hash plus original-timestamp dedup at the SIEM-side ingest for the duration of the replay, then turned it off. Permanent dedup is a different and harder project.

The replay ran clean for most hosts. About 99% of the missing events arrived. The shortfall came from a handful of machines that had cycled their local logs faster than expected (a different rotation policy than the rest of the fleet) and from a small number of containerised services that never wrote to disk at all; those streams had only ever existed on the wire, and the wire had dropped them. We documented the affected sources, recommended a fleet-wide audit of local-retention settings, and moved on.

Root cause: it's not the disk

The /var fillup is what triggered the loss. It is not the root cause. The root cause is that the architecture had no path to surface its own degradation. Five separate things had to be wrong simultaneously for three hours of data to vanish silently overnight, and each was a known anti-pattern with a known fix.

UDP between the sender and the sensorbox. The whole class of incident is structurally impossible with a reliable transport. The deploy was UDP because the original setup, years ago, was UDP, and nobody had revisited.
Monitoring missed the partition that filled, and would not have saved it anyway. Disk alerts existed on the sensorbox, but the rules were per-mountpoint and /var was on a non-standard partition that no rule explicitly named. Even with correct coverage, the fillup rate (several gigabytes per minute from a misbehaving application) was faster than any plausible human response. Lead-time monitoring is necessary but insufficient when the surface that fills can fill faster than a paged engineer can act.
The action queue had a memory bound but no spillover. When the disk filled, the queue backed up to its memory limit and then started dropping silently. It could have spilled to a different filesystem, but it wasn't configured to.
Daemon diagnostic logs lived on /var. When /var filled, the daemon couldn't write the log line that said "my disk filled up." Forensic reconstruction was harder than it should have been.
No downstream alert on absent data. The SIEM had three hours of zero events from one client and didn't notice. There was no "expected minimum events per hour per source" rule. A silently absent stream looks identical to a quiet one.

Each failure alone is survivable. Together they produce silent permanent data loss with zero detection until a human noticed something off the next morning.

Phase 2: short-term fixes to the current architecture

Five fixes, each small, each individually justified, each closing one of the five contributing failures above. None require redesigning anything. They can be applied to the existing system this week.

Disk monitoring that doesn't depend on knowing the layout. Per-mountpoint rules are how this incident slipped through. Replace them with a rule that covers every real filesystem on the host, regardless of where it's mounted:

- alert: HostFilesystemFilling
  expr: |
    node_filesystem_avail_bytes{fstype!~"tmpfs|overlay|squashfs|devtmpfs"}
      / node_filesystem_size_bytes{fstype!~"tmpfs|overlay|squashfs|devtmpfs"} < 0.15
  for: 10m
  labels: { severity: warning }
  annotations:
    summary: "{{ $labels.instance }} {{ $labels.mountpoint }} below 15% free"

A second copy at 0.05 / severity: critical. Page on the critical, ticket the warning. The change here is the absence of a hardcoded mountpoint: any non-pseudo filesystem on any host that drops below threshold fires the alert, no matter where the operator's build script put it.

Monitoring is necessary and won't be sufficient. A pageable alert at 85% gives an engineer maybe forty minutes if the fillup is steady, and zero minutes if the fillup is a runaway application writing gigabytes per minute. The role of monitoring is to catch the slow fills; the fast ones have to be handled by something that doesn't require a human in the loop. That something is the next fix.

Separate the syslog spool from the daemon's own logs. Mount a dedicated filesystem at /var/spool/syslog (or wherever the daemon's queue lives), keep the daemon's diagnostic logs on the root volume. When the spool fills, the daemon can still complain in writing about it.

Bounded queues with explicit overflow behaviour. In rsyslog: queue.type="LinkedList", queue.filename="...", queue.maxdiskspace="5G", action.resumeRetryCount="-1". The queue spills to disk when memory fills. When both fill, the daemon blocks writes rather than dropping silently. Senders see backpressure (if they're on TCP); UDP senders still lose data, but the operator can at least see the queue saturation in monitoring.

Migrate senders to TCP or RELP. RELP (Reliable Event Logging Protocol) is the right answer for any sender that supports it: messages stay on the sender until acknowledged, retries are automatic, the sender survives the receiver restarting. Plain TCP is the right answer for everything else. Audit every syslog source, flip them in stages.

Absent-data alerting downstream. For each client + source pair, compute a rolling minimum events-per-hour baseline. Alert when traffic falls more than 50% below baseline for more than 15 minutes. This is the alert that would have caught the outage at 02:45, not at 09:00 the next morning.

We left the team with a runbook, the Prometheus rules, and an audit table showing every sender, its current transport, and whether an upgrade path existed.

Phase 3: the re-architecture, if anyone wants it

The five fixes above harden the existing design. They don't change its shape. The shape itself is worth saying out loud: a single sensorbox terminating UDP syslog and forwarding to a SIEM is a 1990s pattern that survived into 2026 because nobody had reason to revisit it. The cost is the incident we just investigated. The benefit is mostly inertia.

If we were rebuilding from scratch, the data plane would look different. The change is less about transport than about who holds the data.

	Sensorbox today	Client-owned bucket
System of record	the SOC	the client
Where data lives	SOC sensorbox, finite disk	client's immutable object store
Who pays storage	the SOC	the client's account
Compliance / liability	SOC holds customer data	client owns retention and access
Replay after an outage	per-host log slicing	rewind the bucket cursor
Failure mode	silent UDP drop on full disk	behind the client's contract

The wrong instinct is to keep the SOC as the system of record and just move the transport to a cloud queue (Azure Service Bus, AWS SQS, GCP Pub/Sub). That improves transport reliability but preserves the problem the original design had with custody: the SOC ends up holding the client's data, owning its retention, paying for its compliance footprint, being the entity an auditor or a data-subject-access request comes to. Queues at impressive volumes can be made to work, but they don't change the politics, and at higher volumes their per-message economics get worse than the alternative.

The right instinct is to invert the custody relationship.

The client writes to an immutable bucket they control. Object storage with retention enforcement: S3 Object Lock, GCS Bucket Lock, Azure Blob immutability policies. The bucket lives in the client's cloud account, billed to the client, governed by the client's retention contract. Throughput is no longer a concern about message-queue capacity; object storage handles arbitrary log volume natively. Compliance stops being an awkward conversation about where the SOC stores customer data, because the SOC doesn't.

The bucket is part of the client's contract. This is the structural change, and it sounds bureaucratic while being entirely technical in its consequences: it changes who is liable for what, who has to comply with which regulations, and who a data-access request goes to.

The SOC reads through an anonymisation interface. Between the client's bucket and the SOC's SIEM is a defined boundary that strips, hashes, or generalises whatever fields are out of scope for SOC analysis. The interface is contractual: the client says "you may see these fields raw, these fields hashed, these fields not at all," and the interface enforces it. The SIEM only ever sees the result. Auditing the interface is auditing the data-sharing contract. This is the same pattern that confidential computing, customer-managed encryption keys, and similar zero-trust designs converge on: the boundary is the artefact, not the trust.

The bucket architecture makes the client the system of record. The SOC becomes a downstream reader, not the place the data lives.

What the new shape buys

Replay is a re-read of the bucket. A processing outage on the SOC side recovers by rewinding the bucket cursor and re-fetching. No coordination with the client, no asking for archives, no per-host log slicing. The incident we just investigated would have been a one-line bucket-scan, against immutable data the client retains for the contracted period whether the SOC pulled it or not.

The sensorbox shrinks or disappears. When the client's existing systems already write to cloud object storage (most modern applications can), nothing is needed on the SOC side beyond the bucket reader and the anonymisation interface. When the client's systems emit only syslog (legacy applications), a small shim on the client side translates syslog to bucket writes. The failure mode that started this incident now lives behind the contractual boundary, in infrastructure the client owns and operates.

Per-customer separation is built in. Each client has their own bucket, their own anonymisation interface, their own retention. Multi-tenancy is enforced at the boundary, not at the application layer. The priority-inversion patterns that show up in shared SOAR pipelines do not show up here because there is no shared layer to invert.

Observability is the bucket's exhaust. Object-storage providers emit write rates, object counts, age of latest object, bytes written per minute. Dashboards build themselves. The absent-data alert from Phase 2 becomes "no objects written to client X's bucket for fifteen minutes," which the provider's own monitoring tells you about without any custom rules.

At SOC-realistic volumes (single-digit gigabytes per day per client at the low end, low-terabytes per day per client at the high end), object-storage costs run double-digit to low-three-digit dollars per month per customer, paid out of the client's account because the bucket is theirs. The SOC's own cost becomes the anonymisation interface and the SIEM-side consumer, both well-bounded by design and neither scaling with archive size. The sensorbox host that the existing architecture relies on probably costs more than the entire SOC-side bill of the new design.

The handover

We left the team with three artefacts. A recovery runbook for the next time this happens, in case the re-architecture takes a while. The five short-term fixes, with config snippets, Prometheus rules, and an audit table. An architecture proposal for the rebuild, with cost estimates and a migration sequence that lets the existing system keep running until the new one parity-validates.

None of the long-term changes were urgent. All of them were obvious once the diagnosis was on paper. The team agreed to move on the short-term fixes inside a sprint and put the re-architecture on the next quarter's planning. Without it, the next /var-full incident is a matter of when, not if.

What this kind of work looks like in general

UDP-syslog outages are a category. The pattern is identical across deployments: the sender is UDP, the receiver has a single point of failure with finite disk, the architecture has no path to surface its own degradation, the loss is silent until a human notices something downstream. Most operators inherit this design rather than choose it, which means most of them are one disk-fillup away from the same incident.

Investigation jobs like this resolve the same way every time. Recover what can be recovered. Write up the root cause. Propose short-term fixes that close the contributing failures. Offer a re-architecture for the team that wants to stop having this incident on quarterly rotation. The interesting work is in the third and fourth parts. The first two are mechanics. The fourth is where the actual long-term cost of a 1990s design gets accounted for, and where the new shape stops being theoretical.

A SOC pipeline lost three hours of one client's syslog overnight and nobody noticed until morning. The disk fillup that triggered it was not the root cause: the architecture had no way to surface its own degradation.

What happened

A misbehaving application filled /var on the sensorbox at several gigabytes a minute. The daemon stayed up but could not persist: its queue backed up and the kernel began dropping UDP packets. Logrotate freed disk hours later and the box self-recovered, which is why nobody noticed. We were brought in to recover the gap, find the cause, and propose changes.

Getting the data back

UDP promises nothing: no retry, no ack, no sequence number, so dropped packets are gone. But most hosts also wrote the same records to local files, so recovery meant re-reading and re-shipping them per host. We pinned the window (02:28 to 05:42 UTC) and pushed a one-shot rsyslog config to each host to filter its local file to the window and forward over TCP. Three details made it land cleanly:

stamp events from the message body, not receive time, or the replay lands as a spike today while the gap stays empty;
tag every replayed record (replay=true, batch id) so analysts and rollups can tell backfill from live;
turn on content-hash dedup, since a generous window means some events land twice.

About 99% came back; the misses were hosts that rotated logs faster and containers that never touched disk.

The root cause is not the disk

Five known anti-patterns lined up. UDP transport made the loss unrecoverable. Monitoring never named the partition that filled, and could not have acted fast enough regardless. The queue had a memory bound but no disk spillover. Diagnostic logs lived on the same /var, so the daemon could not record its own failure. And nothing alerted on absent data: zero events from one client looked identical to a quiet one.

Each failure alone is survivable. Together they produce silent, permanent data loss with no detection until a human notices.

The fixes

Five short-term fixes, each closing one failure, none a redesign: a disk alert covering every real filesystem regardless of mountpoint; a separate spool so the daemon can log when its queue fills; bounded queues that spill to disk then block rather than drop; senders moved to TCP or RELP; and a per-source absent-data alert.

The longer answer inverts the shape: rather than a SOC sensorbox holding the client's data on finite disk, the client writes to an immutable bucket they own and the SOC reads through an anonymisation boundary. The client owns retention, billing, and liability, and replay is a cursor rewind, not per-host log slicing.

Sources

RFC 5424, The Syslog Protocol (2009): the message format and the embedded timestamp field the replay keys off.
RFC 6587, Transmission of Syslog Messages over TCP (2012): the framed TCP transport used for the reliable replay.
RFC 3195 and the RELP protocol: reliable, acknowledged log delivery, the right transport for any sender that supports it.
rsyslog documentation on disk-assisted queues (queue.type, queue.maxdiskspace) and imfile: the spillover and file-replay mechanics this recovery relied on.
S3 Object Lock, GCS Bucket Lock, and Azure Blob immutability policies: the retention-enforced object storage behind the client-owned-bucket design.