Alert volume outpaces analyst review every quarter. The queue grows, the tail rots, and on-call burns out triaging duplicates. This post covers an LLM-as-a-judge pipeline we put in front of the alert queue that gives us a triage signal fast enough to keep up.
Why not just human review
Senior analyst review is still the source of truth, but it's slow and expensive at L1 volumes. To shorten the queue we need a triage signal in seconds, not minutes. LLM judges give us that throughput; analyst review stays in the loop as calibration and audit.
Designing the judge
- Rubrics are explicit: severity grounding, contextual fit (does the alert match host posture / known business hours / change windows), freshness, dedup against open incidents.
- The judge returns per-rubric scores plus a short rationale. The rationale is what we actually inspect on disagreements.
- Judges are pairwise by default. Absolute severity ratings are too noisy at the volume we have.
A pairwise judge call looks roughly like this:
prompt = f"""
You are comparing two security alerts.
Rubric: severity_grounding, contextual_fit, freshness, dedup.
Context: {host_context}
Open incidents: {open_incidents}
A: {alert_a}
B: {alert_b}
Return JSON: {{winner, per_rubric, rationale}}.
"""
verdict = judge_model.generate(prompt, schema=VerdictSchema)
Calibration
We periodically sample judge decisions and compare them to senior analyst decisions on the same alerts. The delta is the judge's trust signal; when it drifts we retrain or swap prompts. The disagreement set itself becomes a high-value review queue: it's where the analysts learn the most about both the judge and the underlying detection.
What it changed
Triage capacity is up materially without an increase in false negatives on the ground-truth set. The on-call rotation got quieter, and the alerts that reach an analyst are visibly the right ones.