Adding a new task or dataset
Adding a task has two phases:
| Phase | What | Where (maintained tasks) |
|---|---|---|
| A | Preprocess function + dataset YAMLs | src/xlm/tasks/<task>/, src/xlm/configs/lightning_train/datasets/ |
| B | Datamodule + experiment configs | xlm-models/<family>/configs/ (see Running your model on your task) |
For tasks shipped with xlm-core, preprocessing lives in src/xlm/tasks/. For external tasks, Hydra resolves any importable dotted path (e.g. my_package.my_task.preprocess_fn) — your code does not need to live inside this repo.
Reference example: STAR-easy + ARLM — task code, base YAML, split YAMLs star_easy_{train,val,val_pred,test,test_pred}.yaml, datamodule, experiment.
Prerequisites: Data Pipeline, Metrics, Evaluate.
1. Pick a wiring pattern
| Pattern | When | Start from |
|---|---|---|
| Standard LM / map dataset | Hub dataset, preprocess + cache | default_text.yaml |
| Streaming / sequence packing | Huge corpus, packed blocks | on_the_fly_group_processor: xlm.datamodule.pack_sequences |
| Eval-only / benchmarks | Small splits, no disk cache | default_eval.yaml |
| Heavy optional deps | Chemistry stacks, etc. | Lazy facade (safe_molgen), pip install "xlm-core[...]" |
2. Task code
Maintained tasks: one directory per task under tasks/ with an __init__.py.
External tasks: same pattern in your own package — set preprocess_function: my_package.my_task.preprocess_fn in your dataset YAML.
Hydra resolves callables by dotted path; no need to re-export from a parent __init__.py.
Preprocess function
Signature: (example: dict, tokenizer, **kwargs) -> dict. Set as preprocess_function on the dataset manager; passed to datasets.Dataset.map at runtime.
Drop unneeded columns via columns_to_remove or columns_to_keep.
By example:
| Pattern | Code | Config |
|---|---|---|
| Minimal LM (text → token IDs) | owt | owt_train.yaml |
| Structured seq2seq (STAR) | star | star_easy_train.yaml |
| Row filters | sudoku_extreme | set filter_fn + filter_suffix |
| Post-hoc eval (Math500) | math500 | math500_test.yaml |
| Code-execution eval (GSM8K) | gsm8k | TinyGSM runbook |
| Heavy optional imports | safe_molgen → _impl | defer imports |
3. Dataset YAMLs (src/xlm/configs/lightning_train/datasets/)
Compose from existing base configs with Hydra defaults:.
For training (default_text.yaml):
full_name— Hub path:namespace/dataset_name/splitfull_name_debug— smaller split forDEBUG_OVERFITpreprocess_function,preprocess_function_kwargson_the_fly_processor/on_the_fly_group_processor(mutually exclusive)columns_to_removeorcolumns_to_keepstages— which Lightning stages use this manager (fit,validate,test,predict)collator: ???— resolved by the model datamodule
For eval-only (default_eval.yaml): no disk cache, typical stages: [validate, test].
4. Wire to a model
See Running your model on your task. In short: extend a family skeleton, swap /datasets@… pointers, add an experiment config.
# xlm-models/arlm/configs/datamodule/star_easy_arlm.yaml
defaults:
- star_arlm # inherits collators
- /datasets@datamodule.dataset_managers.train.lm: star_easy_train
- /datasets@datamodule.dataset_managers.val.lm: star_easy_val
# … remaining splits
tags:
dataset: star_easy
5. Metrics and post-hoc evaluation
reported_metrics.<stage>.<dataloader_name> keys must match datamodule.dataset_managers keys. See Metrics.
Dataloaders named *prediction* trigger generative-perplexity and post-hoc evaluator hooks at epoch end.
post_hoc_evaluator:
_target_: xlm.tasks.math500.Math500Eval
See Post-hoc evaluation for details.
6. Dependencies
If your task needs extra packages, document them in the module docstring and add an optional extra (e.g. "xlm-core[safe]") under requirements/.
7. Verify
xlm job_type=prepare_data job_name=prepare_star_easy_arlm experiment=star_easy_arlm debug=overfit
xlm job_type=train job_name=star_easy_arlm_debug experiment=star_easy_arlm debug=overfit
Both should exit 0. Stop the train once you see loss decreasing and val batches loading.
| Problem | Fix |
|---|---|
cannot import name 'no_init_weights' |
Pin transformers<5 |
collator: ??? unresolved |
Wire collators in model datamodule, not dataset YAMLs |
| Duplicate model warnings | Harmless when xlm-models/ is both on disk and editable-installed |
Quick reference
| Artifact | Location |
|---|---|
| Task code | src/xlm/tasks/<task>/ (maintained) or your own package |
| Dataset YAMLs | src/xlm/configs/lightning_train/datasets/ |
| Datamodule composition | src/xlm/configs/lightning_train/datamodule/ + xlm-models/ |
| Metric snippets | src/xlm/configs/lightning_train/metrics/ |
| Pipeline implementation | src/xlm/datamodule.py |