External Language Models for XLM
XLM supports external language models that can be developed and maintained separately from the core framework. This allows researchers to keep their model code clean and self-contained, and share models without including the entire XLM codebase. The code follows a modular design with four main components that work together to provide a complete language modeling solution. You need to implement all of them in order to add a new working model.
Quick Start
1. Scaffold a New Model
Use the scaffolding script to create a complete model structure:
xlm-scaffold my_awesome_model
This creates:
- A complete Python package with skeleton implementations
- All necessary Hydra configuration files
- Registers the model in
xlm_models.json
2. Implement Your Model
The scaffolded files contain detailed TODOs and docstrings. Key files to implement:
my_awesome_model/types_my_awesome_model.py- Type definitionsmy_awesome_model/model_my_awesome_model.py- Neural network architecturemy_awesome_model/loss_my_awesome_model.py- Loss computationmy_awesome_model/predictor_my_awesome_model.py- Inference/generation logicmy_awesome_model/datamodule_my_awesome_model.py- Data preprocessingmy_awesome_model/metrics_my_awesome_model.py- Metrics computation
3. Test Your Model
xlm job_type=train \
job_name=my_model_test \
experiment=star_easy_my_awesome_model \
debug=overfit
Main Components
1. LossFunction
The LossFunction is responsible for computing the training loss during model training, validation and optionally test time.
Key Responsibilities: - Compute loss between model predictions and ground truth targets - Return a dictionary with "loss" key and any other additional values that you want to track.
Interface:
class LossFunction(Generic[T_in, T_out], Protocol):
model: Any
tokenizer: Tokenizer
def loss_fn(self, batch: T_in, ...) -> T_out: ...
def configure(self, pl_module: "Harness"): ...
"""Converts scalar to tensor such that loss_fn becomes compile friendly. If you don't want to compile you don't need to implement this.
"""
Examples: xlm-models/arlm/loss_arlm.py
2. Predictor
The Predictor handles generating output sequences from the model.
Key Responsibilities: - Run the model (typically in a loop) to produce a sequence of tokens. - Convert generated token_ids to text.
Interface:
class Predictor(Generic[T_in, T_out_pred], Protocol):
tokenizer: Tokenizer
noise_schedule: NoiseSchedule
model: Any
def predict(self, batch: T_in, ...) -> T_out_pred: ...
def to_dict(self, batch: T_in, preds: T_out_pred, ...) -> List[Dict[str, Any]]: ...
Examples: xlm-models/arlm/predictor_arlm.py
3. Collator
The Collator prepares batches of data for training and inference. It handles data preprocessing, padding, and batching.
Key Responsibilities: - It receives raw token_ids and converts them to a dict which is passed in as a batch to the model. - Handle padding and truncation.
Interface:
class Collator(Protocol):
"""For pre-training the model on language modeling."""
def __call__(self, examples: List[Dict[str, Any]]) -> Dict[str, Any]: ...
class Seq2SeqCollator(Collator):
"""For training the model on seq2seq tasks."""
def __call__(self, examples: List[Dict[str, Any]]) -> Dict[str, Any]: ...
class Seq2SeqCollatorPrediction(Collator):
"""For generating predictions for seq2seq tasks."""
def __call__(self, examples: List[Dict[str, Any]]) -> Dict[str, Any]: ...
Examples: xlm-models/arlm/datamodule_arlm.py
4. Model
The Model is the bare neural network architecture for your LM. It defines the forward pass and model parameters.
Key Responsibilities: - Define the neural network architecture. - Implement the forward pass.
Interface:
class Model:
def forward(self, input_ids: Tensor, attention_mask: Optional[Tensor] = None, ...) -> Tensor: ...
Examples: xlm-models/arlm/model_arlm.py
All these four components are designed to be aware of each other, and are only expected to run with each other for the same LM and not with any other LM. This is a key design choice that allows one to implement really esoteric models without worrying about how to abstract them such that their dataflow becomes compatible with other LMs.
Model Structure
Each external model follows this structure:
my_awesome_model/ # Model root directory
├── __init__.py
├── types_my_awesome_model.py
├── model_my_awesome_model.py
├── loss_my_awesome_model.py
├── predictor_my_awesome_model.py
├── datamodule_my_awesome_model.py
└── metrics_my_awesome_model.py
├── configs/ # Hydra configurations
│ ├── model/
│ ├── model_type/
│ ├── collator/
│ ├── datamodule/
│ ├── experiment/
│ └── commands.yaml # Optional: see [Custom Commands](custom-commands.md)
├── setup.py # Package installation (optional)
└── README.md # Documentation
Configuration
External models integrate with Hydra's configuration system. Use them in your experiments:
# configs/experiment/my_experiment.yaml
defaults:
- override /model: my_awesome_model
- override /model_type: my_awesome_model
- override /datamodule: star_easy_my_awesome_model
Or via command line:
xlm job_type=train model=my_awesome_model model_type=my_awesome_model
Key config locations (paths may vary for external models in xlm-models/):
configs/lightning_train/collator/– Collator configsconfigs/lightning_train/datamodule/– Datamodule configsconfigs/lightning_train/model/– Model (neural network) configsconfigs/lightning_train/model_type/– Loss, predictor, metricsconfigs/lightning_train/experiment/– Experiment configs
Discovery Methods
XLM discovers external models through two approaches:
1. Directory-Based Discovery
Place your model directory in one of these locations:
- Current directory (
.) xlm-models/directory- Directory specified by the
XLM_MODELS_PATHenvironment variable
Create a xlm_models.json file in the search directory:
{
"my_awesome_model": "my_awesome_model",
"another_model": "path/to/another_model"
}
The paths are relative to the directory containing xlm_models.json.
Example:
# Project structure
.
├── xlm_models.json # {"my_model": "my_model"}
├── my_model/
│ ├── my_model/ # Python package
│ └── configs/
└── ...
2. Package-Based Discovery
Install your model as a Python package and register it via the XLM_MODELS_PACKAGES environment variable (colon-separated list of installed package names).
Package structure requirements:
Each model needs its own setup.py that packages the configs:
# setup.py
setup(
name="my_awesome_model",
packages=["my_awesome_model"],
package_data={
"my_awesome_model": ["configs/**/*.yaml", "configs/**/*.yml"],
},
include_package_data=True,
)
Installation and registration:
Each model must be installed independently:
# Install first model
pip install -e ./my_awesome_model
# Install second model (separate setup.py)
pip install -e ./another_model
# Register both installed packages
export XLM_MODELS_PACKAGES="my_awesome_model:another_model"
Core models (arlm, mlm, ilm, mdlm) are automatically discovered and don't need to be added to XLM_MODELS_PACKAGES.
Troubleshooting
Model Not Found
Error: No module named 'my_model'
Check:
- Model is listed in
xlm_models.json(directory-based) - Model is installed and listed in
XLM_MODELS_PACKAGES(package-based) - Directory structure is correct
__init__.pyexists in the Python package
Duplicate Model Names
ExternalModelConflictError: Duplicate model name: my_model
Solution: Each model must have a unique name. Check for duplicate entries in xlm_models.json files or conflicts with core XLM models.
Config Not Found
hydra.errors.ConfigCompositionException: Could not find 'model/my_model'
Check:
configs/model/my_model.yamlexistsconfigs/model_type/my_model.yamlexists- Config files have valid YAML syntax
_target_paths point to correct classes
Hydra Errors
If you see Unable to find or instantiate abc.xyz.MyClass, try importing manually: python -c "from abc.xyz import MyClass".
Unable to implement a component
Check existing models in xlm-models/ (arlm, mlm, ilm, mdlm) for reference.
Examples
See the core models in the xlm-models package for complete examples:
arlm- Auto-regressive language modelmlm- Masked language modelilm- Infilling language modelmdlm- Masked diffusion language model
We maintain a list of external models in the external-model-list.