Model serialization and deserialization — status
A complete picture of all the model save / load paths we currently support
(or claim to support) across DDP / single-device and FSDP, both
state_dict_type: full (single .ckpt file) and
state_dict_type: sharded (directory with *.distcp + meta.pt).
We use this doc as the source of truth for what is expected to work
out of the box versus what is broken / missing today. The
existing per-feature docs (
extract-checkpoint.md,
push-to-hub.md,
model-loading-for-inference.md,
llms.md)
are partially outdated; the cross-cutting view here is what they
should converge to.
Concepts and artifacts
There are three on-disk artifacts in this system:
| Name | What it is | Created by | Layout |
|---|---|---|---|
| Full checkpoint, single-file | All Lightning state (model + optimizer + EMA + callbacks + trainer/global_step + hyperparameters) in one pickle. | DDP / single-device training. FSDP with state_dict_type: full (rank 0 gathers). |
last.ckpt is a file. |
| Full checkpoint, sharded | One *.distcp shard per rank + a single meta.pt written by rank 0 (trainer state, callbacks, EMA, …). |
FSDP with state_dict_type: sharded (the xlm-core default in trainer_strategy/fsdp.yaml). |
last.ckpt/ is a directory. |
| Model-only checkpoint | Just model.state_dict(). Either a single .safetensors / .pt file, or the HF sharded layout model.safetensors.index.json + model-*-of-*.safetensors. |
extract_checkpoint, consolidate_model_checkpoint, Harness.save_model_weights, Harness.push_to_hub. |
File, or directory containing the HF index file. |
The two full layouts are how training itself writes state; both contain enough
information to resume training. The model-only layout is what we publish to
the HF Hub and what we point model_only_checkpoint_path (or hub.repo_id) at
for inference.
Source-of-truth code:
- Training entry: xlm/commands/lightning_train.py
- Eval / generate: xlm/commands/lightning_eval.py, xlm/commands/lightning_generate.py
- Inference loader (used by eval / generate / push_to_hub): xlm/utils/model_loading.py —
load_model_for_inference - Extract full → model-only (single-file source): xlm/commands/extract_checkpoint.py
- Consolidate FSDP sharded → model-only safetensors: xlm/utils/consolidate_model_checkpoint.py.
- Resume helpers (file or distcp directory): xlm/utils/checkpoint_paths.py
- Hub download + safetensors I/O: xlm/utils/hf_hub.py
- Hub push: xlm/commands/push_to_hub.py →
Model.push_to_hub(PyTorchModelHubMixin) - EMA-aware Harness checkpoint helpers:
Harness.from_checkpoint/Harness.save_model_weights/Harness.push_to_hubin xlm/harness.py
Status matrix
Legend: ✅ works, ⚠️ works but with caveats (see notes), ❌ broken / not supported in current code, — not applicable.
Within each workflow, "DDP / single-device" means a single .ckpt file
(this is also the FSDP state_dict_type: full layout). "FSDP sharded"
means a directory with *.distcp shards.
1. Save full checkpoint during training (Lightning writes it)
| Mode | Layout produced | Status | Notes |
|---|---|---|---|
| DDP / single-device | single .ckpt file |
✅ | Lightning's default ModelCheckpoint. |
FSDP state_dict_type: full |
single .ckpt file |
⚠️ | Works; rank 0 gathers full unsharded params + optimizer state. Prohibitive at 7B+. |
FSDP state_dict_type: sharded (xlm-core default) |
directory last.ckpt/ with *.distcp + meta.pt |
✅ | This is what trainer_strategy/fsdp.yaml produces. OnExceptionCheckpoint is risky under FSDP — drop it. See llms.md §5. |
2. Resume training from a full checkpoint (lightning_train.py)
| Source | Status | Notes |
|---|---|---|
Explicit resume_checkpoint_path = single .ckpt file (DDP / FSDP-full) |
✅ | Validated by resolve_explicit_resume_checkpoint_path. |
Explicit resume_checkpoint_path = FSDP sharded directory (*.distcp + meta.pt) |
✅ | Same helper accepts a dir with *.distcp. |
Auto-pickup on_exception.ckpt / last.ckpt — file |
✅ | find_auto_resume_checkpoint accepts a file. |
Auto-pickup on_exception.ckpt / last.ckpt — sharded directory |
✅ | Same helper accepts a directory with *.distcp. |
| Cross-world-size resume (sharded, W → W′) | ⚠️ | Supported by torch.distributed.checkpoint for sharded ckpts; model architecture and FSDP wrap policy must match. Not exercised by tests in this repo. |
Seeding a new training run from a pre-trained model-only checkpoint (no optimizer state, no EMA, no global_step) is a distinct path; see §3. Seeding from existing model weights below.
3. Seeding from existing model weights (lightning_train.py)
When starting a new training run with pre-trained weights,
lightning_train.py supports loading only the model parameters into
lightning_module.model before trainer.fit. This is not a resume:
no optimizer state, no LR-scheduler state, no global_step, no EMA, no
trainer/callback state.
Precedence: if an explicit resume_checkpoint_path is set, or an
auto-resume hit (on_exception.ckpt / last.ckpt) is found under
checkpointing_dir, that wins and the seeding sources below are
ignored (a rank-zero ERROR log is emitted explaining the override).
| Source | DDP / single-device | FSDP (sharded train output) | Notes |
|---|---|---|---|
model_only_checkpoint_path = single .safetensors / .pt / .bin file |
✅ | ✅ | Loaded via load_model_weights_into_model(map_location="cpu") into the bare nn.Module on every rank; FSDP wrap during trainer.fit setup then shards the populated module. |
model_only_checkpoint_path = HF sharded index (model.safetensors.index.json) |
✅ | ✅ | _load_sharded_safetensors_into_model loads one shard at a time — peak host RAM ≈ model + one shard. This is the only memory-friendly option for FSDP seeding at multi-B scale. |
hub.repo_id (single-file safetensors) |
✅ | ✅ | Every rank calls download_model_weights; HF cache file locks deduplicate the actual download. Loading is then identical to the single-file local case. |
hub.repo_id (sharded safetensors) |
✅ | ✅ | _download_sharded_safetensors pulls the index + all shards into the HF cache, then loaded one shard at a time. |
hub.repo_id (legacy pytorch_model.bin) |
✅ | ✅ | torch.load + load_state_dict. Single-file path → high peak CPU RAM under FSDP. |
strict_model_only_load: false |
✅ | ✅ | Pass-through to load_state_dict(strict=False). Useful when seeding a head from a base model. Missing / unexpected keys are warned, not raised. |
skip_init_weights: true combined with any of the above |
✅ | ✅ | Module is constructed under transformers.modeling_utils.no_init_weights(), then weights are loaded on top. Strongly recommended for FSDP seeding at multi-B scale (saves a full random init on every rank that is about to be overwritten). |
FSDP-specific notes on seeding
- Each rank holds the full state dict briefly on CPU. Seeding happens
before FSDP wrap, so every rank loads the full weights into its bare
module. Peak CPU RAM per rank is roughly model size (single-file
source) or model + one shard (HF sharded index). At 7B+, prefer the
HF-sharded source layout —
extract_checkpointwithpost_training.max_shard_sizeor a direct call toconsolidate_model_checkpointwithmax_shard_size— and standard Hub-published models use it. - Seeding loads before the wrap; full-checkpoint resume loads after.
This is intentional and is the opposite of workflow #2: FSDP sets
restore_checkpoint_after_setup = True, so a resume materializes weights into already-shardedFlatParameters, but a seed fills the bare module first and then the strategy shards it. Both arrive at the same post-wrap shape; the memory profile and the failure modes differ. Practical consequence: anauto_wrap_policymismatch breaks resume with a load-time error, but breaks seeding silently at wrap time (the module already has weights, FSDP just wraps the whole thing as one unit). - No EMA on the seeding path.
load_model_weights_into_modelonly sets parameters fromstate_dict. If you need EMA-averaged weights as the seed, the source file must already contain them (extract_checkpoint apply_ema=Truefor single-file ckpts; the FSDP-sharded equivalent does not exist today — see workflow #4). map_locationis hardcoded to"cpu". Don't point at a CUDA device; FSDP wrap is what moves and shards onto GPUs.- Conflict with full-checkpoint sources. If
resume_from_checkpointis true and either an explicitresume_checkpoint_pathor an auto-resumeon_exception.ckpt/last.ckptis present, the seeding sources are dropped — even if both are explicitly set in the config. Rank-zero logsResume checkpoint is set; model-only / Hub weight sources are ignored.
4. Convert a full checkpoint into a model-only checkpoint
| Source layout | Tool | Output | Status | Notes |
|---|---|---|---|---|
Single-file .ckpt (DDP or FSDP-full) |
xlm job_type=extract_checkpoint (extract_checkpoint.py) |
.pth (torch.save) and/or hub push |
✅ | Uses Harness.from_checkpoint(..., apply_ema=True) → Harness.save_model_weights / Harness.push_to_hub. Supports EMA application. |
| FSDP sharded directory | xlm job_type=extract_checkpoint |
.safetensors (and/or Hub as single model.safetensors) |
✅ | Dispatches to consolidate_model_checkpoint; apply_ema must be false (raises if true). Hub path uses load_model_for_inference → Harness.push_to_hub. Optional post_training.max_shard_size for HF-sharded local output. |
| FSDP sharded directory | xlm job_type=extract_checkpoint (consolidate_model_checkpoint via extract_checkpoint) |
single .safetensors file, or HF sharded layout when max_shard_size is set (e.g. "5GB") |
✅ | No EMA on this path (apply_ema=false). Optional post_training.max_shard_size for HF-sharded local output. Requires enough CPU RAM for the full model. |
| FSDP sharded directory with EMA application | (none) | — | ❌ | By policy: no EMA on FSDP/sharded extract. Use a single-file full .ckpt and extract_checkpoint with apply_ema=true, or export non-EMA weights from sharded checkpoints. |
The recommended FSDP path at 7B is: train sharded → consolidate to safetensors (single or HF-sharded) on a host with enough CPU RAM → publish / load for inference.
5. Load from local disk for eval / generate (lightning_eval.py, lightning_generate.py)
Both commands go through load_model_for_inference and look at the
{prefix}.ckpt_path / {prefix}.checkpoint_path keys (full checkpoint) and
the {prefix}.model_only_checkpoint_path key (model-only).
| Source | Eval (DDP) | Eval (FSDP, sharded training output) | Generate (single device today) | Notes |
|---|---|---|---|---|
Single-file .ckpt (full) |
✅ | ⚠️ | ✅ | DDP/single-device: Harness.load_from_checkpoint. FSDP eval theoretically wraps after weights are loaded into the bare module, but this path is not exercised in tests and the file-only os.path.isfile check on the config value silently rejects sharded directories (see below). |
FSDP sharded directory as eval.ckpt_path / generation.ckpt_path |
❌ | ❌ | ❌ | _get_full_checkpoint_path in model_loading.py does if not os.path.isfile(ckpt_path): raise ValueError(...). Sharded directories are rejected before any FSDP-aware loader runs. This is the symmetric gap to workflow #2 — train resume accepts dirs, eval/generate does not. |
Auto-fallback to best.ckpt / last.ckpt (eval only) |
✅ for file | ❌ for sharded dir | — | _get_full_checkpoint_path also uses os.path.isfile for the fallback search; sharded last.ckpt/ directories are never picked up. |
Local model_only_checkpoint_path = single .safetensors / .pt file |
✅ | ✅ | ✅ | load_model_weights_into_model handles .safetensors and pickle. For FSDP eval, the bare module is built, weights loaded on CPU, then the trainer wraps and shards on validate. |
Local model_only_checkpoint_path = HF sharded layout (point at the model.safetensors.index.json file) |
✅ | ✅ | ✅ | _load_sharded_safetensors_into_model loads one shard at a time. Peak host RAM ≈ model + one shard. |
eval.model_only_checkpoint_path together with EMA |
⚠️ | ⚠️ | ⚠️ | EMA is not re-applied at load time. The file must already contain EMA-averaged weights (extract_checkpoint with apply_ema=true on single-file ckpts only; not on the FSDP-sharded extract_checkpoint path — see workflow #4). |
In the inference path, the os.path.isfile checks in _get_full_checkpoint_path (both
the explicit and the fallback branches) are the single most impactful bug — it
makes FSDP-sharded eval / generate / push-to-hub all silently unsupported even when
the rest of the pipeline could handle them.
6. Load model-only weights from the HF Hub for eval / generate
| Hub layout | Eval | Generate | Notes |
|---|---|---|---|
Single-file model.safetensors |
✅ | ✅ | hf_hub.download_model_weights tries this first. |
Sharded safetensors (model.safetensors.index.json + model-*-of-*.safetensors) |
✅ | ✅ | Falls through to _download_sharded_safetensors, then _load_sharded_safetensors_into_model (per-shard loading). |
Legacy pytorch_model.bin |
✅ | ✅ | Tried last; loaded via torch.load + load_state_dict. |
hub.revision=<branch/tag/commit> |
✅ | ✅ | Passed through to hf_hub_download. |
init_dtype: bfloat16 / float16 + skip_init_weights: true (large-model friendly) |
✅ | ✅ | Honored by load_model_for_inference — both apply when weights come from a model-only source (Hub or local). They are not consulted on the full-checkpoint branch of load_from_checkpoint (init_dtype is still applied as default dtype, but skip_init_weights is ignored). |
FSDP eval reaching for Hub weights is the same code path as local
model-only loading — weights are loaded into the bare module on CPU, then
trainer.validate wraps and shards. Generate today does not wrap (manual
predict loop), so it is effectively single-device.
7. Push to the HF Hub (push_to_hub.py)
push_to_hub ultimately calls Model.push_to_hub(...) (i.e.
PyTorchModelHubMixin.push_to_hub), which in turn calls
PyTorchModelHubMixin._save_pretrained — that saves
model.safetensors (a single file) plus generated config.json / README,
then uploads the folder.
| Weight source | Status | Notes |
|---|---|---|
Local single-file .ckpt via hub_checkpoint_path |
✅ | load_model_for_inference(config_prefix="") with manual_ema_restore=True → EMA is applied via Harness.from_checkpoint. |
Local FSDP sharded directory via hub_checkpoint_path |
❌ | Same _get_full_checkpoint_path / os.path.isfile issue as workflow #5. Consolidate to model-only safetensors first and pass the result via model_only_checkpoint_path. |
Local model-only single file via model_only_checkpoint_path (.safetensors / .pt) |
✅ | Module is instantiated and weights loaded; the upload re-serializes a single model.safetensors. EMA is not re-applied — make sure the file already has EMA weights. |
Local model-only HF-sharded layout via model_only_checkpoint_path=…/model.safetensors.index.json |
⚠️ | Loading works (per-shard). On upload, PyTorchModelHubMixin._save_pretrained only writes a single model.safetensors, so the multi-shard layout is flattened. For ≤ ~50 GB (Hub single-file limit) and enough host RAM that is fine; for larger models the push is unsupported. |
| Push a pre-built multi-shard safetensors folder verbatim | ❌ | No command does this. push_to_hub always re-serializes through _save_pretrained. You'd have to use HfApi.upload_folder yourself. |
hub.branch=<name> |
✅ | Branch is created via HfApi.create_branch if missing. |
hub.commit_message defaulting |
✅ | Falls back to a generated message mentioning the source paths. |
Optional hub_checkpoint_path + model_only_checkpoint_path both set |
⚠️ | Full ckpt wins; conflict is logged as an error but not fatal. See model-loading-for-inference.md. |
Cross-cutting issues and gaps
Pulling the broken cells out of the matrix gives this concrete TODO list:
_get_full_checkpoint_pathusesos.path.isfile(both the explicit branch and thebest.ckpt/last.ckptfallback). This is the symmetric sibling of thelightning_train.pyfix and the single biggest blocker:eval.checkpoint_path/eval.ckpt_path= FSDP sharded dir → rejected.generation.ckpt_path= FSDP sharded dir → rejected.hub_checkpoint_path= FSDP sharded dir → rejected.-
Eval auto-fallback never picks up
last.ckpt/if it is a sharded dir. The training-side path already hasis_usable_lightning_train_checkpoint_pathinxlm/utils/checkpoint_paths.py; the inference-side check should be refactored on top of the same helpers. -
extract_checkpointFSDP UX. Theextract_checkpointcommand now dispatches on sharded dirs;extract-checkpoint.mddocumentsapply_ema=falseand optionalmax_shard_size. Call consolidate_model_checkpoint directly when you need a Harness-free export. -
No EMA on FSDP / sharded extract (by design).
consolidate_model_checkpointonly exportsstate_dictweights. For EMA-averaged publication, use a single-file full checkpoint andextract_checkpointwithapply_ema=true, or save EMA weights during training to a separate artifact. -
Hub upload paths differ.
push_to_hub/Harness.push_to_hubstill serialize a singlemodel.safetensorsviaPyTorchModelHubMixin. For multi-shard Hub uploads, consolidate locally withmax_shard_sizeand upload the output folder viaHfApi.upload_folder. Models above the single-file Hub limit need this path, notjob_type=push_to_hubalone. -
Documentation drift. Right now the relevant material is spread across four files and they don't agree:
docs/guide/extract-checkpoint.md— covers both single-file and FSDP-shardedextract_checkpoint; see also consolidate_model_checkpoint for Harness-free export.docs/guide/push-to-hub.md— single-file only;model_only_checkpoint_pathis described but the sharded safetensors index variant is not.- wiki/model-loading-for-inference.md
— accurate for inference loading but does not enumerate which sources
are accepted as full ckpts (and so does not flag the
os.path.isfilegap). docs/guide/llms.md§5 — FSDP sharded checkpoints, consolidation, andextract_checkpointbehavior. This page (serialization.md) is intended to subsume the cross-cutting parts; the others can shrink to feature-specific usage and link here.
What "works as expected today" — short version
If you are on this page to figure out what is safe to rely on right now:
- DDP / single-device, single-file
.ckpt: every workflow (train, resume, extract, eval, generate, push to hub) is supported. - FSDP sharded training and resume: train, save, resume — supported,
including auto-pickup of
last.ckpt/andon_exception.ckpt/directories. - FSDP → model-only / Hub:
xlm job_type=extract_checkpoint(withapply_ema=false) on a sharded directory, or call consolidate_model_checkpoint directly. Then usemodel_only_checkpoint_pathpointing at.safetensorsormodel.safetensors.index.jsonfor eval / generate /push_to_hubas needed. - HF Hub for inference (eval / generate): works for single-file,
sharded safetensors, and legacy
pytorch_model.bin, on both branches and revisions.
What is not safe to assume today, even though the configs and docs imply it:
- Pointing eval / generate / push_to_hub at an FSDP sharded directory
(
*.ckpt/with*.distcp) silently fails with "Checkpoint path does not exist". - Asking
extract_checkpointto useapply_ema=trueon an FSDP sharded directory fails by design — use a single-file checkpoint for EMA export. - Publishing an EMA-averaged checkpoint that was saved only as FSDP
sharded — use
state_dict_type: fullfor a single-file export, or a separate EMA artifact; we do not merge EMA from sharded dirs inextract_checkpoint. - Publishing a model larger than the Hub single-file limit via
job_type=push_to_hubalone —_save_pretrainedwrites one file; consolidate withmax_shard_sizeand upload the folder viaHfApi.upload_folder.