Cross-source incident search

An incident always crosses sources. The first alert lands in the SIEM, the next clue is a span ID in the traces, the smoking gun is a process tree in the EDR, and the attribution lives in a threat-intel platform. The analyst is the integration layer, reconciling four tools by hand, and that hand-off is where most incidents lose their first minutes.

The pattern we wanted to break

Open four tabs. Search the same hostname four different ways, each with its own query language. Reconcile by eyeball. Lose context every time you switch. The work is not the analysis; it is the carrying of a fact from one console to the next, and it is paid in the minutes that matter most.

The analyst is the integration layer, and an integration layer made of tab-switching is the slowest one you can build.

What we built

One query box over all four telemetry stores, with three moving parts:

A federated retrieval layer that fans a single text query out to all four telemetry indices in parallel.
A cross-encoder re-ranker that merges the results with knowledge of provenance and recency.
A per-source fallback path, so a slow index degrades to recency-only instead of stalling the whole response.

Query understanding

Queries get rewritten into structured sub-intents: host, actor, window, indicator. Each sub-intent biases retrieval toward the index most likely to answer it, so a hostname leans on the logs and an actor name leans on threat-intel, without the analyst choosing a tool first.

Four indices fanned out in parallel, merged by a cross-encoder that knows about source and recency.

The retrieval loop is short:

q = query_encoder(text)
cands = union(
    log_index.knn(q, k=200),
    trace_index.knn(q, k=200),
    endpoint_index.knn(q, k=200),
    intel_index.knn(q, k=200),
)
return cross_encoder.rank(text, cands)[:20]

Latency budget

Analysts will tolerate roughly 800ms for first results, so we spent the budget deliberately:

Stage	Budget	If it overruns
Fan-out	200ms	drop the slowest index from this query
Per-index retrieval	400ms	fall back to recency-only for that index
Re-rank	200ms	return the union unranked rather than stall

Each index gets to skip cheaply if it cannot hit its window, so one slow store never holds the whole response hostage.

Results with provenance

Every result row renders a small badge for its source, the index recency, and the rank score. Without that, analysts read "high-rank" as "trusted" and skip evidence from the lower-recency sources, which is often exactly where the quiet part of an incident hides.

A rank score without provenance teaches analysts to trust the loudest source, not the right one.

An incident crosses four tools: the SIEM alerts, the traces hold a span ID, the EDR has the process tree, threat-intel has the attribution. Today the analyst reconciles them by hand, four tabs and four query languages, and that hand-off burns the first minutes of every incident.

What it does. One text query box over all four telemetry stores. The query is rewritten into sub-intents (host, actor, window, indicator), each biasing retrieval toward the index that answers it best. Three parts carry it:

federated retrieval: fan one query out to all four indices in parallel;
a cross-encoder re-ranker that merges results knowing source and recency;
a per-source fallback, so a slow index degrades to recency-only instead of stalling.

One slow store never holds the whole response hostage; it degrades instead.

Latency and provenance

The budget is 800ms. Split it 200ms fan-out, 400ms per-index retrieval, 200ms re-rank. Any stage that overruns degrades rather than blocks: drop the slow index, fall back to recency-only, or return the union unranked.

Every row shows its source. A badge carries source, index recency, and rank score. Skip that and analysts read high-rank as trusted, then ignore the lower-recency evidence where the quiet part of an incident often lives.

Sources

Cross-encoder re-ranking: a transformer scores the query and each candidate jointly, more accurate than the bi-encoder used for first-stage retrieval, and the standard second stage in retrieve-then-rank.
Approximate nearest-neighbour search (HNSW and friends): the index structure behind each store's knn call, trading a little recall for the latency budget above.