Scaling LLM Post-Training

Most of the interesting work on a production LLM happens after pre-training: supervised fine-tuning, preference alignment, domain adaptation, and the tooling that holds it all together. This post describes how we scaled our post-training stack from a single-node prototype to a cluster that runs several alignment experiments a day.

What changed

We moved from per-experiment bespoke scripts to a declarative recipe format that captures data, objective, schedule, and evaluation in one file.
We introduced dataset de-duplication and caching layers so the same training shards don't get re-tokenized every run.
We added elastic scheduling so smaller jobs can preempt idle nodes between big runs.

The thing that scaled was not the model. It was everything around the model that used to be a person running a script.

A recipe is just YAML, the scheduler and trainer are the same for every job:

name: sft-catalog-assistant-v4
base_model: base-8b
data:
  train: s3://datasets/catalog-assistant/train
  eval:  s3://datasets/catalog-assistant/eval
objective: sft
schedule:
  lr: 2.0e-5
  warmup_steps: 200
  max_steps: 4000
  precision: bf16
eval:
  benchmarks: [catalog_qa, synopsis_faithfulness]
  cadence: every_500_steps

Post-training pipeline, each stage is independent and cacheable

Where the wins came from

Most of the speedup wasn't from GPU kernels, it was from removing serial bottlenecks in the surrounding pipeline. Data prep, eval, and checkpoint syncing were each blocking the critical path; parallelizing them dropped wall-clock time by more than half. The trainer never got faster. The waiting got shorter.

Stage	Single-node prototype	Cluster stack
Config	bespoke script per experiment	one declarative recipe (data, objective, schedule, eval)
Data prep	re-tokenized every run	de-duplicated and cached across runs
Scheduling	one job owns the node	elastic preemption fills idle nodes
Critical path	prep, train, eval, sync run serially	each stage parallel against cached artifacts

What we're still figuring out

Reward model drift across experiments is our biggest open problem. Small shifts in reference data can cause optimization to chase different objectives, and detecting that early is an active area of work.

Throughput was the easy half. Knowing the run aligned to the objective we meant is the half we are still building.

Most of the work on a production LLM happens after pre-training: fine-tuning, preference alignment, domain adaptation. We took that stack from a single-node prototype to a cluster running several experiments a day. Three changes did the work.

One recipe format. Every experiment used to be a bespoke script. Now a single YAML file captures the data, the objective, the schedule, and the eval, and the same scheduler and trainer run every job.

Cached data prep. We de-duplicate datasets and cache the tokenized shards, so the same training data isn't re-tokenized on every run.

Elastic scheduling. Smaller jobs preempt idle nodes between big runs, so the cluster doesn't sit empty waiting on one experiment.

Where the wins came from

Not from faster GPU kernels. From removing serial bottlenecks around the trainer. Data prep, eval, and checkpoint sync each blocked the critical path; running them in parallel cut wall-clock time by more than half.

declarative recipe instead of per-experiment scripts;
de-duplicated, cached data prep instead of re-tokenizing;
elastic preemption instead of one job per node.

The trainer never got faster. The waiting got shorter.

Still open

Reward model drift is the hard part. Small shifts in reference data push optimization toward different objectives, and catching that early is work we haven't finished.

Sources

Our internal post-training platform notes (2026): the recipe format, caching layers, and elastic scheduler described here.
Goodhart's Law, in Marilyn Strathern's formulation (1997): "When a measure becomes a target, it ceases to be a good measure," the shape of the reward-drift problem.

Scaling LLM post-training

What changed

Where the wins came from

What we're still figuring out

Where the wins came from

Still open

Sources