LLM-as-a-Judge for Alert Triage

Alert volume outpaces analyst review every quarter. The queue grows, the tail rots, and on-call burns out triaging duplicates. We put an LLM-as-a-judge pipeline in front of the queue: a triage signal fast enough to keep up, with senior review kept where it still pays. Here is the design and what it actually changed.

Why not just human review

Senior analyst review is still the source of truth, but it is slow and expensive at L1 volumes. To shorten the queue we need a triage signal in seconds, not minutes. The judge gives us that throughput. Analyst review does not disappear: it moves to where it earns its cost, calibrating the judge and auditing its disagreements.

The judge does not replace the analyst. It changes what the analyst spends attention on: not the queue, but the judge that drains it.

Designing the judge

Three choices carry the design, and each is a defence against a known failure of LLM scoring:

Rubrics are explicit: severity grounding, contextual fit (does the alert match host posture, known business hours, change windows), freshness, dedup against open incidents.
The judge returns per-rubric scores plus a short rationale. The rationale is what we actually inspect on disagreements, so the score is auditable, not a bare number.
Judges are pairwise by default. Absolute severity ratings drift and cluster at the volume we have; "is A worse than B" is a steadier question than "rate A from one to ten".

A pairwise judge call looks roughly like this:

prompt = f"""
You are comparing two security alerts.
Rubric: severity_grounding, contextual_fit, freshness, dedup.

Context: {host_context}
Open incidents: {open_incidents}
A: {alert_a}
B: {alert_b}

Return JSON: {{winner, per_rubric, rationale}}.
"""
verdict = judge_model.generate(prompt, schema=VerdictSchema)

Pairwise comparisons reduce the noise of absolute single-sample severity scoring.

Calibration

We periodically sample judge decisions and compare them to senior analyst decisions on the same alerts. The delta is the judge's trust signal: when it drifts, we retrain or swap prompts. The disagreement set itself becomes a high-value review queue. It is where the analysts learn the most about both the judge and the underlying detection.

Decision	Old path	New path
Bulk triage	analyst reads every alert	judge ranks; analyst reads the top
Severity	absolute rating, noisy	pairwise verdict plus rationale
Trust check	spot-check, ad hoc	sampled judge-versus-analyst delta
Analyst time	spent on the queue	spent on disagreements

What it changed

Triage capacity is up materially without an increase in false negatives on the ground-truth set. The on-call rotation got quieter, and the alerts that reach an analyst are visibly the right ones. The judge bought throughput; calibration is what keeps that throughput trustworthy.

Alert volume grows faster than analysts can review it, so the queue rots and on-call burns out on duplicates. We put an LLM judge in front of the queue to produce a triage signal in seconds, and kept senior review for calibration and audit.

The judge is pairwise, not absolute. Asking "is A worse than B" is steadier than rating one alert from one to ten, which drifts and clusters at volume. The judge scores each alert against explicit rubrics and returns a short rationale, so a decision is auditable rather than a bare number.

The rubrics are:

severity grounding;
contextual fit (host posture, business hours, change windows);
freshness;
dedup against open incidents.

The judge drains the queue. The analyst now audits the judge, not the alerts.

Calibration keeps it honest

We sample judge decisions and compare them to senior analyst decisions on the same alerts. That delta is the trust signal: when it drifts, we retrain or swap prompts. The disagreement set becomes its own review queue, the highest-value alerts to look at by hand.

Result: triage capacity up, no rise in false negatives on the ground-truth set, a quieter rotation. The judge bought throughput; calibration is what keeps it trustworthy.

Sources

Zheng et al., "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" (2023): pairwise comparison is more reliable than absolute scoring, and judges carry position and verbosity biases worth controlling for.
The companion piece, human in the loop is theater: why a watcher over a green stream decays into rubber-stamping, and why calibration on disagreements beats passive oversight.

LLM-as-a-judge for security alert triage