The brief
A SOC pipeline went dark. A sensorbox handling UDP syslog for one client had filled its /var overnight; by morning the disk was healthy again, log rotation having caught up, and the system was processing events normally. The gap, three hours of one client's stream, was real and silent. Nothing in the pipeline could recover it from where it was. We weren't part of the original design. We were called in after the fact, with three deliverables on the table: recover what could be recovered, write up the root cause, propose what should change so this stops happening.
This is the trip report. It generalizes. UDP-syslog outages are a category rather than a one-off, and investigation jobs like this resolve the same way most of the time.
The investigation
The first hour was diagnostic. Three questions: how much was lost, when exactly the loss started and ended, and what state the box had been in during the outage.
/var was at 64% by the time we logged in, comfortably clear of trouble. The journal told the story:
journalctl -u rsyslog --since "2 days ago" \
| grep -iE "no space|enospc|queue|dropped"
df -h /var
ls -la /var/log/ | head
The fillup had happened around 02:30 UTC. The daemon stayed up but couldn't persist new messages. Writes failed with ENOSPC, the action queue backed up to its memory bound, and within minutes the kernel was dropping UDP packets at the socket. Recovery began at 05:40 when logrotate's daily run trimmed enough older content to free disk; the daemon flushed its queue and resumed normal operation. No human action was involved. The system had self-recovered, which is part of why nobody noticed.
We pulled the existing monitoring config while we were there. Prometheus alerts existed for the root filesystem and for the named application volumes, with thresholds at 85% and 95%. /var was on its own partition under the operator's standard build, and no rule explicitly named it. The monitoring had been working as configured for years; it had simply never been asked about /var. Even if it had been, the fillup was fast enough (a misbehaving application started producing several gigabytes per minute around 02:25) that a paged engineer probably could not have logged in and cleaned up before the daemon stopped persisting. Both layers of "this should have been caught" turned out to be wishful thinking under the actual incident's timing.
That gave the outage window: 02:30 to 05:40 UTC, three hours and ten minutes. The sender side was a fleet, not a single box. Dozens of application hosts emitted via UDP to our sensorbox while also writing locally to their own log files. The OS socket on each host was the only buffer between the application and the wire, and that buffer was small. Packets emitted during the outage were transmitted, dropped at our receiver, and never seen on any sender's network stack again.
The local files on each host were a different story. Most of the fleet retained the original records in their native paths: /var/log/syslog, application-specific files, the usual layout. The forwarding had failed. The local writing had not. Recovery was going to be possible, but it was going to be per-host, slicing each machine's log file separately, rather than one cut from a central archive.
Phase 1: pulling the data back
UDP syslog is a contract that doesn't promise delivery. The sender writes the packet, hands it to the kernel, and never hears back. There is no retry, no acknowledgement, no sequence number. Packets dropped during the outage are erased from the network path. The only recovery available is to read the equivalent records from wherever else they happen to exist (in this case, the local log files on each origin host) and ship them again, in a way that lands correctly the second time.
Step one was to pin the window to the second. We gave the client 02:28 UTC as the lower bound (a couple of minutes before the first failure in our logs) and 05:42 UTC as the upper bound (a couple of minutes after the daemon's first successful write post-recovery). Off by a minute either way is fine. Off by an hour produces duplicates or gaps the team will fight about later.
Step two was choosing a transport. With the source data spread across dozens of hosts rather than centralised in one archive, the choice was effectively between two shapes: pull or push. Pull meant SCP'ing the time-windowed log slice from every host to a single staging area on our side and feeding it through a file source. Push meant running a small one-shot forwarder on each host that read the local file, filtered by timestamp, and shipped over a reliable transport. We went with push for three reasons. The fleet already had a configuration-management agent we could use to drop a temporary config; pulling would have needed credentialed access to every host. Push isolated each host's progress: one stuck transfer didn't block the others. And push let us use the sender's own clock as the timestamp source, which mattered for the next step.
The shape of the push:
- On each origin host, a temporary rsyslog config with an
imfilesource pointed at the local log file, with astartmsg.regexmatching syslog line headers and areadModeappropriate for the file format. - A filter rule that dropped any line whose embedded timestamp fell outside the 02:28-05:42 UTC window.
- An action sending to a dedicated TCP port (RFC 6587) on the sensorbox, with the
replay=truetag added to each message. - A one-shot lifecycle: the config was deployed, the daemon HUP'd, the window flushed, the config removed.
Other options we considered and discarded: telling each host to UDP-replay over the original port (re-creates the failure mode that started this), forwarding through the existing application-level syslog relay (would have re-mixed the streams and lost the per-host attribution we needed for dedup), and bulk file ingest via the SIEM's direct API (would have required onboarding every host's authentication, when the sensorbox already had the SIEM credential).
Step three was timestamps. Replayed messages arrive today but represent yesterday's events. By default, most syslog-to-SIEM pipelines stamp ingestion time onto every record. Without configuration changes, the replay would land as a spike today and the dashboards would still show a gap during the actual outage window. The fix is to extract the timestamp from the message body, not from the receive time. In rsyslog this is the difference between $timegenerated and $timereported; we wanted $timereported. The SIEM-side parser was already pointing at the embedded timestamp field, which made the downstream a non-issue. Other deployments are not always so lucky.
Step four was tagging. Every replayed record got replay=true and replay_batch=<incident-id> at ingest. Two reasons. Analysts looking at events in the recovered window needed to know what was fresh observation versus backfill. And the morning's dashboards needed to filter the replay batch out of "events in the last hour" rollups so the metrics didn't look like a SOC traffic spike.
Step five was deduplication. We were conservative on the window, so some events would land twice. We turned on content-hash plus original-timestamp dedup at the SIEM-side ingest for the duration of the replay, then turned it off. Permanent dedup is a different and harder project.
The replay ran clean for most hosts. About 99% of the missing events arrived. The shortfall came from a handful of machines that had cycled their local logs faster than expected (a different rotation policy than the rest of the fleet) and from a small number of containerised services that never wrote to disk at all; those streams had only ever existed on the wire, and the wire had dropped them. We documented the affected sources, recommended a fleet-wide audit of local-retention settings, and moved on.
Root cause: it's not the disk
The /var fillup is what triggered the loss. It is not the root cause. The root cause is that the architecture had no path to surface its own degradation. Five separate things had to be wrong simultaneously for three hours of data to vanish silently overnight, and each was a known anti-pattern with a known fix.
- UDP between the sender and the sensorbox. The whole class of incident is structurally impossible with a reliable transport. The deploy was UDP because the original setup, years ago, was UDP, and nobody had revisited.
- Monitoring missed the partition that filled, and would not have saved it anyway. Disk alerts existed on the sensorbox, but the rules were per-mountpoint and /var was on a non-standard partition that no rule explicitly named. Even with correct coverage, the fillup rate (several gigabytes per minute from a misbehaving application) was faster than any plausible human response. Lead-time monitoring is necessary but insufficient when the surface that fills can fill faster than a paged engineer can act.
- The action queue had a memory bound but no spillover. When the disk filled, the queue backed up to its memory limit and then started dropping silently. It could have spilled to a different filesystem, but it wasn't configured to.
- Daemon diagnostic logs lived on /var. When /var filled, the daemon couldn't write the log line that said "my disk filled up." Forensic reconstruction was harder than it should have been.
- No downstream alert on absent data. The SIEM had three hours of zero events from one client and didn't notice. There was no "expected minimum events per hour per source" rule. A silently absent stream looks identical to a quiet one.
Each failure alone is survivable. Together they produce silent permanent data loss with zero detection until a human noticed something off the next morning.
Phase 2: short-term fixes to the current architecture
Five fixes, each small, each individually justified, each closing one of the five contributing failures above. None require redesigning anything. They can be applied to the existing system this week.
Disk monitoring that doesn't depend on knowing the layout. Per-mountpoint rules are how this incident slipped through. Replace them with a rule that covers every real filesystem on the host, regardless of where it's mounted:
- alert: HostFilesystemFilling
expr: |
node_filesystem_avail_bytes{fstype!~"tmpfs|overlay|squashfs|devtmpfs"}
/ node_filesystem_size_bytes{fstype!~"tmpfs|overlay|squashfs|devtmpfs"} < 0.15
for: 10m
labels: { severity: warning }
annotations:
summary: "{{ $labels.instance }} {{ $labels.mountpoint }} below 15% free"
A second copy at 0.05 / severity: critical. Page on the critical, ticket the warning. The change here is the absence of a hardcoded mountpoint: any non-pseudo filesystem on any host that drops below threshold fires the alert, no matter where the operator's build script put it.
Monitoring is necessary and won't be sufficient. A pageable alert at 85% gives an engineer maybe forty minutes if the fillup is steady, and zero minutes if the fillup is a runaway application writing gigabytes per minute. The role of monitoring is to catch the slow fills; the fast ones have to be handled by something that doesn't require a human in the loop. That something is the next fix.
Separate the syslog spool from the daemon's own logs. Mount a dedicated filesystem at /var/spool/syslog (or wherever the daemon's queue lives), keep the daemon's diagnostic logs on the root volume. When the spool fills, the daemon can still complain in writing about it.
Bounded queues with explicit overflow behaviour. In rsyslog: queue.type="LinkedList", queue.filename="...", queue.maxdiskspace="5G", action.resumeRetryCount="-1". The queue spills to disk when memory fills. When both fill, the daemon blocks writes rather than dropping silently. Senders see backpressure (if they're on TCP); UDP senders still lose data, but the operator can at least see the queue saturation in monitoring.
Migrate senders to TCP or RELP. RELP (Reliable Event Logging Protocol) is the right answer for any sender that supports it: messages stay on the sender until acknowledged, retries are automatic, the sender survives the receiver restarting. Plain TCP is the right answer for everything else. Audit every syslog source, flip them in stages.
Absent-data alerting downstream. For each client + source pair, compute a rolling minimum events-per-hour baseline. Alert when traffic falls more than 50% below baseline for more than 15 minutes. This is the alert that would have caught the outage at 02:45, not at 09:00 the next morning.
We left the team with a runbook, the Prometheus rules, and an audit table showing every sender, its current transport, and whether an upgrade path existed.
Phase 3: the re-architecture, if anyone wants it
The five fixes above harden the existing design. They don't change its shape. The shape itself is worth saying out loud: a single sensorbox terminating UDP syslog and forwarding to a SIEM is a 1990s pattern that survived into 2026 because nobody had reason to revisit it. The cost is the incident we just investigated. The benefit is mostly inertia.
If we were rebuilding from scratch, the data plane would look different.
The wrong instinct is to keep the SOC as the system of record and just move the transport to a cloud queue (Azure Service Bus, AWS SQS, GCP Pub/Sub). That improves transport reliability but preserves the problem the original design had with custody: the SOC ends up holding the client's data, owning its retention, paying for its compliance footprint, being the entity an auditor or a data-subject-access request comes to. Queues at impressive volumes can be made to work, but they don't change the politics, and at higher volumes their per-message economics get worse than the alternative.
The right instinct is to invert the custody relationship.
The client writes to an immutable bucket they control. Object storage with retention enforcement: S3 Object Lock, GCS Bucket Lock, Azure Blob immutability policies. The bucket lives in the client's cloud account, billed to the client, governed by the client's retention contract. Throughput is no longer a concern about message-queue capacity; object storage handles arbitrary log volume natively. Compliance stops being an awkward conversation about where the SOC stores customer data, because the SOC doesn't.
The bucket is part of the client's contract. This is the structural change. The original architecture made the SOC the system of record. The client's logs flowed in and were held by the SOC for whatever retention the SOC's SIEM happened to enforce. The bucket architecture makes the client the system of record. The SOC is a downstream consumer with read-only access to the analytical surface it needs. The shift sounds bureaucratic and is actually structural: it changes who is liable for what, who has to comply with which regulations, and who a data-access request goes to.
The SOC reads through an anonymization interface. Between the client's bucket and the SOC's SIEM is a defined boundary that strips, hashes, or generalises whatever fields are out of scope for SOC analysis. The interface is contractual: the client says "you may see these fields raw, these fields hashed, these fields not at all," and the interface enforces it. The SIEM only ever sees the result. Auditing the interface is auditing the data-sharing contract. This is the same pattern that confidential computing, customer-managed encryption keys, and similar zero-trust designs converge on: the boundary is the artefact, not the trust.
Replay is a re-read of the bucket. A processing outage on the SOC side recovers by rewinding the bucket cursor and re-fetching. No coordination with the client, no asking for archives, no per-host log slicing. The incident we just investigated would have been a one-line bucket-scan, against immutable data the client retains for the contracted period whether the SOC pulled it or not.
The sensorbox shrinks or disappears. When the client's existing systems already write to cloud object storage (most modern applications can), nothing is needed on the SOC side beyond the bucket reader and the anonymization interface. When the client's systems emit only syslog (legacy applications), a small shim on the client side translates syslog to bucket writes. The failure mode that started this incident now lives behind the contractual boundary, in infrastructure the client owns and operates.
Per-customer separation is built in. Each client has their own bucket, their own anonymization interface, their own retention. Multi-tenancy is enforced at the boundary, not at the application layer. The priority-inversion patterns that show up in shared SOAR pipelines do not show up here because there is no shared layer to invert.
Observability is the bucket's exhaust. Object-storage providers emit write rates, object counts, age of latest object, bytes written per minute. Dashboards build themselves. The absent-data alert from Phase 2 becomes "no objects written to client X's bucket for fifteen minutes," which the provider's own monitoring tells you about without any custom rules.
At SOC-realistic volumes (single-digit gigabytes per day per client at the low end, low-terabytes per day per client at the high end), object-storage costs run double-digit to low-three-digit dollars per month per customer, paid out of the client's account because the bucket is theirs. The SOC's own cost becomes the anonymization interface and the SIEM-side consumer, both well-bounded by design and neither scaling with archive size. The sensorbox host that the existing architecture relies on probably costs more than the entire SOC-side bill of the new design.
The handover
We left the team with three artifacts. A recovery runbook for the next time this happens, in case the re-architecture takes a while. The five short-term fixes, with config snippets, Prometheus rules, and an audit table. An architecture proposal for the rebuild, with cost estimates and a migration sequence that lets the existing system keep running until the new one parity-validates.
None of the long-term changes were urgent. All of them were obvious once the diagnosis was on paper. The team agreed to move on the short-term fixes inside a sprint and put the re-architecture on the next quarter's planning. Without it, the next /var-full incident is a matter of when, not if.
What this kind of work looks like in general
UDP-syslog outages are a category. The pattern is identical across deployments: the sender is UDP, the receiver has a single point of failure with finite disk, the architecture has no path to surface its own degradation, the loss is silent until a human notices something downstream. Most operators inherit this design rather than choose it, which means most of them are one disk-fillup away from the same incident.
Investigation jobs like this resolve the same way every time. Recover what can be recovered. Write up the root cause. Propose short-term fixes that close the contributing failures. Offer a re-architecture for the team that wants to stop having this incident on quarterly rotation. The interesting work is in the third and fourth parts. The first two are mechanics. The fourth is where the actual long-term cost of a 1990s design gets accounted for, and where the new shape stops being theoretical.