Skip to content

Adding a new task or dataset

This guide walks through wiring a Hugging Face dataset (or similar source) into xlm-core: Python preprocessing in xlm.tasks, Hydra configs under configs/lightning_train/, and alignment between dataloader names and metrics.

Read these guides first if you have not worked with the pipeline before:

  • Data Pipeline — how TextDataModule, DatasetManager, and dataloader names connect to the Harness.
  • MetricsMetricWrapper, reported_metrics, and prediction/post-hoc behavior.
  • Evaluate a model — loading checkpoints for evaluation jobs.

Cursor users may also invoke the xlm-create-task Agent Skill (~/.cursor/skills/xlm-create-task/) for a condensed checklist and copied excerpts.


1. Pick a wiring pattern

Pattern Use when Typical config
Standard LM / map dataset Hub dataset; preprocess once and cache via DatasetManager datasets/default_text.yaml defaults → DatasetManager
Streaming / sequence packing Huge corpus; packed blocks without per-example padding iterable_dataset_shards, on_the_fly_group_processor: xlm.datamodule.pack_sequences (see Data Pipeline)
Eval-only / benchmarks Smaller splits; avoid manual save_to_disk cache semantics datasets/default_eval.yamlEvalDatasetManager
Heavy optional dependencies Chemistry stacks, etc. Lazy facade package that imports implementation on demand (safe_molgen/__init__.py); document pip install "xlm-core[...]"

2. Implement task code (src/xlm/tasks/<your_task>/__init__.py)

Put each task in its own directory under tasks/ with an __init__.py that defines preprocess functions, filters, evaluators, etc.

Hydra resolves callables by dotted path (for example xlm.tasks.owt.preprocess_fn). You do not need to list every symbol in the parent tasks/__init__.py—that file is only a package marker, not a barrel re-export of all tasks.

Preprocess function

Configured as preprocess_function on the dataset manager. At runtime it is loaded with get_function and passed to datasets.Dataset.map. The effective signature is:

(example: dict, tokenizer, **preprocess_function_kwargs) -> dict

Return only the columns your downstream collator and model need (drop raw text via columns_to_remove, or restrict with columns_to_keep on eval configs).

Minimal LM-style preprocessing — encode raw text to token IDs; combine with on_the_fly_processor: xlm.datamodule.token_ids_to_input_ids:

Structured fields + different processors per split — build task-specific columns (for example input_token_ids, prompt_token_ids), then choose on_the_fly_processor per dataset YAML:

Optional row filters — set filter_fn to a dotted path and filter_suffix (required when filter_fn is set). Example filter implementations live beside preprocessing in sudoku_extreme/__init__.py.

Benchmark-style eval + post-hoc scoring — preprocessing builds prompt/target columns and carries gold answers through the batch for logging; a post_hoc_evaluator class scores saved predictions:

Large optional imports — keep the default import surface light and defer heavy imports (see safe_molgen/__init__.py delegating to _safe_molgen_impl.py).

For larger tasks you may instead split logic into additional modules inside tasks/<your_task>/ and re-export public symbols from __init__.py; Hydra paths stay xlm.tasks.<your_task>.<symbol>.


3. Add dataset YAML (src/xlm/configs/lightning_train/datasets/)

Create a new file or compose existing ones with Hydra defaults:.

  1. Training / general DatasetManager — start from default_text.yaml:

  2. full_name: Hub path in the form namespace/dataset_name/split (last segment is the split passed to load_dataset).

  3. full_name_debug: optional smaller split used when DEBUG_OVERFIT is enabled.
  4. preprocess_function, preprocess_function_kwargs.
  5. on_the_fly_processor / on_the_fly_group_processor (mutually exclusive with group processors per DatasetManager).
  6. columns_to_remove or columns_to_keep.
  7. stages: which Lightning stages use this manager (fit, validate, test, predict).
  8. collator: ??? — resolved by experiment/model package via /collator@... overrides.

  9. Eval-only EvalDatasetManager — start from default_eval.yaml:

  10. No manual disk cache path; HF map cache optional via load_from_cache_file.

  11. Typical stages: [validate, test].

Reuse patterns from math500_test.yaml for columns_to_keep and preprocess kwargs.


4. Compose a datamodule config

Datamodule skeleton: datamodule/default.yaml.

Wire each (split, dataloader_name) to your dataset group using Hydra defaults, for example:

defaults:
  - default
  - /datasets@datamodule.dataset_managers.train.lm: owt_train
  - /datasets@datamodule.dataset_managers.val.lm: owt_val

Full example: datamodule/owt.yaml.

If you introduce a new logical evaluation stream (for example math500_prediction), pick a new dataloader name and use it consistently across val/test and metrics:

Set tags.dataset in the experiment/datamodule root when you want W&B or logging tags (same file).


5. Wire collators

Dataset YAMLs leave collator: ???. Resolve it from the model/experiment package:

- /collator@datamodule.dataset_managers.val.math500_prediction.collator: math500_pred_dream

The collator must produce batch tensors your model and metric update functions expect.


6. Wire metrics and optional post-hoc evaluation

Under reported_metrics.<train|val|test>.<dataloader_name>, each nested dict entry is a configured metric (typically MetricWrapper). The <dataloader_name> keys must match the keys under datamodule.dataset_managers for that stage — see Metrics.

Hydra composition often looks like:

defaults:
  - /metrics@reported_metrics.val.lm.accumulated_loss: accumulated_loss

For dataloaders whose name contains prediction, epoch-end hooks run extra aggregates (generative perplexity and configured post-hoc evaluators); see Metrics.

Post-hoc evaluator _target_ example:

post_hoc_evaluator:
  _target_: xlm.tasks.math500.Math500Eval

Experiment reference: xlm-models/dream/configs/experiment/math500_dream_eval.yaml.


7. Dependencies and extras

If your task needs packages beyond core xlm-core, document them in the task module docstring and add or reference an optional extra in project requirements (for example "xlm-core[safe]", files under requirements/).


8. Verify

  1. Run job_type=prepare_data (rank 0 downloads and preprocesses/caches where applicable).
  2. Smoke train or eval with the smallest experiment config you can.
  3. Use DEBUG_OVERFIT and full_name_debug to force train-shaped data into val/test during debugging (DatasetManager).
  4. Add or extend integration coverage under tests/integration/datamodule/ when behavior is easy to regress.

Quick reference table

Artifact Location
Task preprocessing / filters / evaluators src/xlm/tasks/<task>/ (one directory per task)
Dataset group YAML src/xlm/configs/lightning_train/datasets/
Datamodule composition src/xlm/configs/lightning_train/datamodule/ (+ model repos under xlm-models/)
Metric snippets src/xlm/configs/lightning_train/metrics/
Pipeline implementation src/xlm/datamodule.py