Optimising recommendations with the JDK Vector API

Recommendation ranking spends most of its compute on dot products, billions of them per request, against models with millions of items. The JDK's Vector API (formerly incubator) exposes SIMD lanes from the JVM, letting us write hot paths that compile down to the CPU's widest vector instructions without dropping into native code.

What the Vector API buys us

A single portable loop replaces platform-specific intrinsics.
The JIT picks the widest available lane count at runtime (AVX-512, AVX2, or fallback).
We keep memory layout and allocation in our control, no JNI hops.

The inner loop of a float dot product ends up looking like this:

static final VectorSpecies<Float> S = FloatVector.SPECIES_PREFERRED;

float dot(float[] a, float[] b) {
    FloatVector acc = FloatVector.zero(S);
    int i = 0;
    int upper = S.loopBound(a.length);
    for (; i < upper; i += S.length()) {
        var va = FloatVector.fromArray(S, a, i);
        var vb = FloatVector.fromArray(S, b, i);
        acc = va.fma(vb, acc);
    }
    float sum = acc.reduceLanes(VectorOperators.ADD);
    for (; i < a.length; i++) sum += a[i] * b[i];
    return sum;
}

One SIMD iteration covers what would have been 16 scalar iterations

Measured impact

On our ranking tier we measured a 2.5 to 3x speedup on the dot-product hot path, with p99 latency dropping by roughly 40%. That translated into fewer ranking replicas per region at the same SLO.

Caveats

Vector API code is easy to write wrong, alignment and tail handling catch you. We built a small microbenchmark suite as part of the CI gate so that any future change to the loop has to prove it doesn't regress.

Optimising recommendation systems with the JDK Vector API

What the Vector API buys us

Measured impact

Caveats