Most of the interesting work on a production LLM happens after pre-training: supervised fine-tuning, preference alignment, domain adaptation, and the tooling that holds it all together. This post describes how we scaled our post-training stack from a single-node prototype to a cluster that runs multiple alignment experiments a day.
What changed
- We moved from per-experiment bespoke scripts to a declarative recipe format that captures data, objective, schedule, and evaluation in one file.
- We introduced dataset de-duplication and caching layers so the same training shards don't get re-tokenized every run.
- We added elastic scheduling so smaller jobs can preempt idle nodes between big runs.
A recipe is just YAML, the scheduler and trainer are the same for every job:
name: sft-catalog-assistant-v4
base_model: base-8b
data:
train: s3://datasets/catalog-assistant/train
eval: s3://datasets/catalog-assistant/eval
objective: sft
schedule:
lr: 2.0e-5
warmup_steps: 200
max_steps: 4000
precision: bf16
eval:
benchmarks: [catalog_qa, synopsis_faithfulness]
cadence: every_500_steps
Where the wins came from
Most of the speedup wasn't from GPU kernels, it was from removing serial bottlenecks in the surrounding pipeline. Data prep, eval, and checkpoint syncing were each blocking the critical path; parallelizing them dropped wall-clock time by more than half.
What we're still figuring out
Reward model drift across experiments is our biggest open problem. Small shifts in reference data can cause optimization to chase different objectives, and detecting that early is an active area of work.