Recommendation ranking spends most of its compute on dot products, billions of them per request, against models with millions of items. The JDK's Vector API (formerly incubator) exposes SIMD lanes from the JVM, letting us write hot paths that compile down to the CPU's widest vector instructions without dropping into native code.

What the Vector API buys us

The inner loop of a float dot product ends up looking like this:

static final VectorSpecies<Float> S = FloatVector.SPECIES_PREFERRED;

float dot(float[] a, float[] b) {
    FloatVector acc = FloatVector.zero(S);
    int i = 0;
    int upper = S.loopBound(a.length);
    for (; i < upper; i += S.length()) {
        var va = FloatVector.fromArray(S, a, i);
        var vb = FloatVector.fromArray(S, b, i);
        acc = va.fma(vb, acc);
    }
    float sum = acc.reduceLanes(VectorOperators.ADD);
    for (; i < a.length; i++) sum += a[i] * b[i];
    return sum;
}
Scalar loop (1 float / cycle) SIMD w/ Vector API (16 floats / cycle) tail
One SIMD iteration covers what would have been 16 scalar iterations

Measured impact

On our ranking tier we measured a 2.5–3x speedup on the dot-product hot path, with p99 latency dropping by roughly 40%. That translated into fewer ranking replicas per region at the same SLO.

Caveats

Vector API code is easy to write wrong, alignment and tail handling catch you. We built a small microbenchmark suite as part of the CI gate so that any future change to the loop has to prove it doesn't regress.