Adding a maintained model

This guide covers adding a first-party model family under xlm-models/ in this repository. For a model in a separate repo, see External models.

Reference implementations: arlm, ilm, mlm, mdlm, flexmdm, dream, llada. Conceptual comparison: Models overview.

Quick start

xlm-scaffold my_family

This scaffolds Python modules, Hydra configs, and registers the family in xlm_models.json. See External models for scaffold details (the same tool applies to in-repo families).

Set up your environment first: Install from source in the Contributing hub.

Four components

Every working model implements four pieces that plug into the harness:

Component	Role	Typical module
Model	Neural network and forward pass	`model_<family>.py`
Loss	Training objective	`loss_<family>.py`
Predictor	Inference / generation	`predictor_<family>.py`
Collator	Batch construction	`datamodule_<family>.py`

You also typically add types_<family>.py (batch TypedDicts) and metrics_<family>.py (metric update functions).

Components for one family are designed to work together only with that family — not across families. See harness.py for LossFunction and Predictor protocols, and datamodule.py for Collator.

Directory layout

xlm-models/<family>/
├── __init__.py
├── types_<family>.py
├── model_<family>.py
├── loss_<family>.py
├── predictor_<family>.py
├── datamodule_<family>.py
├── metrics_<family>.py
└── configs/
    ├── model/
    ├── model_type/
    ├── collator/
    ├── datamodule/
    └── experiment/

Checklist

Implement the family under xlm-models/<family>/ (mirror existing families).
Add Hydra configs under xlm-models/<family>/configs/.
Register in xlm_models.json (xlm-scaffold does this).
Add tests under tests/models/<family>/ using mixins in _base.py — see Unit tests.
Optionally add docs/models/<family>.md, a mkdocs.yml nav entry, and an api-autonav module path.
Optionally add a CLI smoke entry in test_smoke.py once a minimal experiment config exists.

Hydra configuration

Configs for a model family live under xlm-models/<family>/configs/. Hydra discovers them via the external-models search path (see external_models.py).

Logical composition:

experiment
├── datamodule (+ collators per dataloader)
└── model components
    ├── model (architecture)
    ├── model_type (loss, predictor, metrics)
    └── noise_schedule, trainer, …

Collator configs

A family often defines several collators:

Base training — unconditional LM batches (masking, padding, targets).
Seq2seq training — prefix + suffix in one batch.
Seq2seq prediction — prompt-only or prompt + separate target for metrics.

Example (ARLM default collator):

# xlm-models/arlm/configs/collator/default_arlm.yaml
_target_: arlm.datamodule_arlm.DefaultARLMCollator
block_size: ${block_size}
tokenizer: ${global_components:tokenizer}
noise_schedule: ${global_components:noise_schedule}

Datamodule config

Wire collators per split/dataloader name, e.g. xlm-models/arlm/configs/datamodule/star_easy_arlm.yaml:

# @package _global_
defaults:
  - default
  - /collator@datamodule.dataset_managers.train.lm.collator: default_arlm
  - /collator@datamodule.dataset_managers.val.lm.collator: default_arlm

datamodule:
  print_batch_fn: arlm.datamodule_arlm.print_batch_arlm

tags:
  dataset: star_easy_arlm

Model config

Architecture hyperparameters only, e.g. xlm-models/<family>/configs/model/<family>.yaml:

# @package _global_
model:
  _target_: my_family.model_my_family.MyFamilyModel
  num_embeddings: ${tokenizer:full_vocab_size}
  d_model: 768
  # ...
tags:
  model: my_family_small

Model type config

Loss, predictor, and default metrics — e.g. arlm/configs/model_type/arlm.yaml:

# @package _global_
defaults:
  - /metrics@reported_metrics.train.lm.accumulated_loss: accumulated_loss
  - /metrics@reported_metrics.val.lm.accumulated_loss: accumulated_loss
  - /metrics@reported_metrics.test.lm.accumulated_loss: accumulated_loss

lightning_module:
  _target_: xlm.harness.Harness

loss:
  _target_: arlm.loss_arlm.ARLMLoss

predictor:
  _target_: arlm.predictor_arlm.ARLMPredictor
  tokenizer: ${lightning_module:tokenizer}
  noise_schedule: ${lightning_module:noise_schedule}
  max_steps: ${block_size}
  max_length: ${eval:${block_size}+${oc.select:input_block_size,0}}

reported_metrics:
  train:
    lm:
      accumulated_loss:
        prefix: train/lm
        update_fn: arlm.metrics_arlm.mean_metric_update_fn
  # val / test analogous
tags:
  model_type: arlm

Experiment config

Compose overrides, e.g. arlm/configs/experiment/star_easy_arlm.yaml:

# @package _global_
defaults:
  - override /datamodule: star_easy_arlm
  - override /noise_schedule: dummy
  - override /model_type: arlm
  - override /model: rotary_transformer_small_arlm

per_device_batch_size: 64
block_size: 128

Run a smoke train:

xlm job_type=train job_name=my_family_debug experiment=star_easy_my_family debug=overfit

Harness integration

Harness wires your components from config:

Instantiates the model from model/ config.
Configures loss with model and tokenizer.
Configures predictor with model, tokenizer, and noise schedule.
Uses your collator via the datamodule.

Testing

Unit tests (required)

Follow the mixin pattern in Unit tests:

tests/models/<family>/test_model_<family>.py — inherit BaseModelTests
tests/models/<family>/test_loss_<family>.py — inherit BaseLossTests
tests/models/<family>/test_collator_<family>.py — inherit BaseCollatorTests
Add predictor tests as needed

pytest tests/models/<family>/ -v

Debug / smoke runs

Quick integration check without full training:

xlm job_type=train job_name=my_family_debug experiment=star_easy_my_family debug=overfit

For CI smoke coverage, append (experiment, job_type) to SMOKE_RUNS in test_smoke.py — see Running tests.

Example: ARLM

Piece	Module	Config
Model	model_arlm.py	`configs/model/rotary_transformer_small_arlm.yaml`
Loss	loss_arlm.py	`configs/model_type/arlm.yaml`
Predictor	predictor_arlm.py	`configs/model_type/arlm.yaml`
Collators	datamodule_arlm.py	`configs/collator/default_arlm.yaml`, `seq2seq_arlm.yaml`, `seq2seq_pred_arlm.yaml`
Experiment	—	star_easy_arlm.yaml

Narrative doc: ARLM.

Troubleshooting

Problem	What to check
`Unable to find or instantiate …`	Import the class manually: `python -c "from my_family.model_my_family import MyFamilyModel"`
Config not found	`configs/model/<family>.yaml` and `configs/model_type/<family>.yaml` exist; YAML `_target_` paths are correct
Model not discovered	Entry in `xlm_models.json`; `pip install -e ./xlm-models`

Train an existing model on a new dataset

Use Adding a task or dataset to add preprocessing and dataset YAMLs under src/xlm/configs/lightning_train/datasets/, then add a datamodule config under xlm-models/<family>/configs/datamodule/ that wires the new dataset and collators.