Skip to content

Adding a new task or dataset

Adding a task has two phases:

Phase What Where (maintained tasks)
A Preprocess function + dataset YAMLs src/xlm/tasks/<task>/, src/xlm/configs/lightning_train/datasets/
B Datamodule + experiment configs xlm-models/<family>/configs/ (see Running your model on your task)

For tasks shipped with xlm-core, preprocessing lives in src/xlm/tasks/. For external tasks, Hydra resolves any importable dotted path (e.g. my_package.my_task.preprocess_fn) — your code does not need to live inside this repo.

Reference example: STAR-easy + ARLM — task code, base YAML, split YAMLs star_easy_{train,val,val_pred,test,test_pred}.yaml, datamodule, experiment.

Prerequisites: Data Pipeline, Metrics, Evaluate.


1. Pick a wiring pattern

Pattern When Start from
Standard LM / map dataset Hub dataset, preprocess + cache default_text.yaml
Streaming / sequence packing Huge corpus, packed blocks on_the_fly_group_processor: xlm.datamodule.pack_sequences
Eval-only / benchmarks Small splits, no disk cache default_eval.yaml
Heavy optional deps Chemistry stacks, etc. Lazy facade (safe_molgen), pip install "xlm-core[...]"

2. Task code

Maintained tasks: one directory per task under tasks/ with an __init__.py.

External tasks: same pattern in your own package — set preprocess_function: my_package.my_task.preprocess_fn in your dataset YAML.

Hydra resolves callables by dotted path; no need to re-export from a parent __init__.py.

Preprocess function

Signature: (example: dict, tokenizer, **kwargs) -> dict. Set as preprocess_function on the dataset manager; passed to datasets.Dataset.map at runtime.

Drop unneeded columns via columns_to_remove or columns_to_keep.

By example:

Pattern Code Config
Minimal LM (text → token IDs) owt owt_train.yaml
Structured seq2seq (STAR) star star_easy_train.yaml
Row filters sudoku_extreme set filter_fn + filter_suffix
Post-hoc eval (Math500) math500 math500_test.yaml
Code-execution eval (GSM8K) gsm8k TinyGSM runbook
Heavy optional imports safe_molgen_impl defer imports

3. Dataset YAMLs (src/xlm/configs/lightning_train/datasets/)

Compose from existing base configs with Hydra defaults:.

For training (default_text.yaml):

  • full_name — Hub path: namespace/dataset_name/split
  • full_name_debug — smaller split for DEBUG_OVERFIT
  • preprocess_function, preprocess_function_kwargs
  • on_the_fly_processor / on_the_fly_group_processor (mutually exclusive)
  • columns_to_remove or columns_to_keep
  • stages — which Lightning stages use this manager (fit, validate, test, predict)
  • collator: ??? — resolved by the model datamodule

For eval-only (default_eval.yaml): no disk cache, typical stages: [validate, test].


4. Wire to a model

See Running your model on your task. In short: extend a family skeleton, swap /datasets@… pointers, add an experiment config.

# xlm-models/arlm/configs/datamodule/star_easy_arlm.yaml
defaults:
  - star_arlm                                        # inherits collators
  - /datasets@datamodule.dataset_managers.train.lm: star_easy_train
  - /datasets@datamodule.dataset_managers.val.lm: star_easy_val
  # … remaining splits
tags:
  dataset: star_easy

5. Metrics and post-hoc evaluation

reported_metrics.<stage>.<dataloader_name> keys must match datamodule.dataset_managers keys. See Metrics.

Dataloaders named *prediction* trigger generative-perplexity and post-hoc evaluator hooks at epoch end.

post_hoc_evaluator:
  _target_: xlm.tasks.math500.Math500Eval

See Post-hoc evaluation for details.


6. Dependencies

If your task needs extra packages, document them in the module docstring and add an optional extra (e.g. "xlm-core[safe]") under requirements/.


7. Verify

xlm job_type=prepare_data job_name=prepare_star_easy_arlm experiment=star_easy_arlm debug=overfit
xlm job_type=train job_name=star_easy_arlm_debug experiment=star_easy_arlm debug=overfit

Both should exit 0. Stop the train once you see loss decreasing and val batches loading.

Problem Fix
cannot import name 'no_init_weights' Pin transformers<5
collator: ??? unresolved Wire collators in model datamodule, not dataset YAMLs
Duplicate model warnings Harmless when xlm-models/ is both on disk and editable-installed

Quick reference

Artifact Location
Task code src/xlm/tasks/<task>/ (maintained) or your own package
Dataset YAMLs src/xlm/configs/lightning_train/datasets/
Datamodule composition src/xlm/configs/lightning_train/datamodule/ + xlm-models/
Metric snippets src/xlm/configs/lightning_train/metrics/
Pipeline implementation src/xlm/datamodule.py

More examples: ARLM, TinyGSM (FlexMDM, MLM, ARLM).