Models
Each xlm-models/<family>/ package implements one language-modeling paradigm against a small set of shared xlm.* abstractions. The four families documented here — ARLM, ILM, MDLM, MLM — share the same component layout (model, loss, predictor, collator, metrics, types) but differ in what they predict and how they decode.
Shared abstractions
Every family wires the same five interfaces from xlm-core:
| Abstraction | Module | Key methods | Family-specific subclass |
|---|---|---|---|
Model |
src/xlm/model.py | get_named_params_for_weight_decay, get_named_params_for_no_weight_decay |
RotaryTransformer<X>Model, MDLMModel |
LossFunction[T_in, T_out] |
src/xlm/harness.py | configure, loss_fn, __call__ |
<X>Loss |
Predictor[T_in, T_out_pred] |
src/xlm/harness.py | predict, to_dict, generate |
<X>Predictor |
Collator |
src/xlm/datamodule.py | __call__(examples) -> Batch |
Default<X>Collator, seq2seq variants |
MetricWrapper updates |
src/xlm/metrics.py | *_update_fn(batch, loss_dict, tokenizer=None) |
<X>/metrics_<X>.py |
What differs across families is the forward signature and the batch contract:
flowchart LR
MLM["MLM<br/>forward(x_t, attn_mask, positions, block_mask?)"] --> Logits["logits (B, L, V)"]
MDLM["MDLM<br/>forward(x_t, total_noise, attn_mask, positions)"] --> Logits
ARLM["ARLM<br/>forward(x_t, causal_attn_mask_3D, positions, token_type_ids?)"] --> Logits
ILM["ILM<br/>forward(x_t, attn_mask, positions, token_type_ids?, cls_position?)"] --> ILMOut["(vocab_logits, length_logits)"]
Family comparison
| ARLM | ILM | MDLM | MLM | |
|---|---|---|---|---|
| Paradigm | Left-to-right causal LM | Insertion LM | Masked discrete diffusion (continuous-time absorbing) | Masked LM (BERT-style) |
| Backbone | RotaryTransformerLayer (RoPE) |
RotaryTransformerLayer or GPT-2 backbone |
DDiTLayer (AdaLN + RoPE), time conditioning via TimestepEmbedder |
RotaryTransformerLayer (RoPE) |
| Conditioning signal | Causal 3-D mask | Optional token_type_ids + cls_position |
Continuous-time t -> total_noise (passed as AdaLN cond) |
None beyond attention_mask (+ optional block_mask) |
| Forward output | logits (B, L, V) |
(vocab_logits, length_logits) — length_logits is None for the base model |
logits (B, L, V) |
logits (B, L, V) |
| Loss type | Cross-entropy with ignore_index=-100 |
Masked CE over dropped positions, optional length CE / binary CE head | Weighted CE (noise_rate / expm1(total_noise)) on [MASK] positions |
Cross-entropy on [MASK] positions only (default) |
| Default collator | DefaultARLMCollator |
DefaultILMCollator (token-drop noising) |
DefaultMDLMCollator (needs real noise schedule) |
DefaultMLMCollator |
| Seq2seq collators | ARLMSeq2SeqCollator, ARLMSeq2SeqPredCollator |
ILMSeq2SeqCollator, ILMSeq2SeqPredCollator |
MDLMSeq2SeqTrainCollator, MDLMSeq2SeqPredCollator |
MLMSeq2SeqCollator, MLMSeq2SeqTrainCollator, MLMSeq2SeqPredCollator, _MLMSeq2SeqPredCollator, MLMInfillWithExactTargetPredCollator, DefaultInfillMLMCollator, PackedMLMCollator |
| Decoding loop | Greedy / top-k / top-p sampled tokens, one per step, up to max_length |
Insertion at a chosen position per step; optional length-head stopping | max_steps unmasking steps with diffusion sampling and dt time decrement |
max_steps unmasking steps; uniform or confidence-based position selection (prob_diff / top_prob) |
| Source package | xlm-models/arlm/ | xlm-models/ilm/ | xlm-models/mdlm/ | xlm-models/mlm/ |
| Per-family doc | arlm.md | ilm.md | mdlm.md | mlm.md |
Page layout
Every per-family page follows the same template:
- Overview — paradigm, paper citation
- Files at a glance — public classes/functions per module
- Architecture — forward signature and inputs/outputs
- Batch contract — required vs optional fields with shapes
- Loss — math and masking rules
- Collators — table of all collator classes
- Predictor — decoding loop, stopping rule, output dict
- Metrics — update_fn callables
- Configs / experiments — pointer to
xlm-models/<family>/configs/ - Testing — pointer to
tests/models/<family>/ - API reference — mkdocstrings links
Read external-models.md for how to scaffold a new family that conforms to this contract.