Most of the interesting work on a production LLM happens after pre-training: supervised fine-tuning, preference alignment, domain adaptation, and the tooling that holds it all together. This post describes how we scaled our post-training stack from a single-node prototype to a cluster that runs multiple alignment experiments a day.

What changed

A recipe is just YAML, the scheduler and trainer are the same for every job:

name: sft-catalog-assistant-v4
base_model: base-8b
data:
  train: s3://datasets/catalog-assistant/train
  eval:  s3://datasets/catalog-assistant/eval
objective: sft
schedule:
  lr: 2.0e-5
  warmup_steps: 200
  max_steps: 4000
  precision: bf16
eval:
  benchmarks: [catalog_qa, synopsis_faithfulness]
  cadence: every_500_steps
Recipe Data prep Training Eval Ship Every stage parallelized against cached artifacts
Post-training pipeline, each stage is independent and cacheable

Where the wins came from

Most of the speedup wasn't from GPU kernels, it was from removing serial bottlenecks in the surrounding pipeline. Data prep, eval, and checkpoint syncing were each blocking the critical path; parallelizing them dropped wall-clock time by more than half.

What we're still figuring out

Reward model drift across experiments is our biggest open problem. Small shifts in reference data can cause optimization to chase different objectives, and detecting that early is an active area of work.