Alert volume outpaces analyst review every quarter. The queue grows, the tail rots, and on-call burns out triaging duplicates. This post covers an LLM-as-a-judge pipeline we put in front of the alert queue that gives us a triage signal fast enough to keep up.

Why not just human review

Senior analyst review is still the source of truth, but it's slow and expensive at L1 volumes. To shorten the queue we need a triage signal in seconds, not minutes. LLM judges give us that throughput; analyst review stays in the loop as calibration and audit.

Designing the judge

A pairwise judge call looks roughly like this:

prompt = f"""
You are comparing two security alerts.
Rubric: severity_grounding, contextual_fit, freshness, dedup.

Context: {host_context}
Open incidents: {open_incidents}
A: {alert_a}
B: {alert_b}

Return JSON: {{winner, per_rubric, rationale}}.
"""
verdict = judge_model.generate(prompt, schema=VerdictSchema)
Alert A Alert B Judge (pairwise) Verdict + rationale
Pairwise comparisons reduce the noise of absolute single-sample severity scoring.

Calibration

We periodically sample judge decisions and compare them to senior analyst decisions on the same alerts. The delta is the judge's trust signal; when it drifts we retrain or swap prompts. The disagreement set itself becomes a high-value review queue: it's where the analysts learn the most about both the judge and the underlying detection.

What it changed

Triage capacity is up materially without an increase in false negatives on the ground-truth set. The on-call rotation got quieter, and the alerts that reach an analyst are visibly the right ones.