Skip to content

XLM

TinyGSM

dhruvdcoder/xlm-core

TinyGSM

TinyGSM/TinyGSM on Hugging Face provides roughly 11.8M training examples: math word problems (question) and Python solutions (code). In xlm-core each example is wired as a seq2seq MDM task: the prefix is question + "\n" and the suffix is code, following the field layout in PUMA tiny_gsm.py.

Memmap pretokenization is not supported (labels.bin, prompt_mask.bin, and related offline paths). Data flows only through DatasetManager, job_type=prepare_data, and iterable shards at train time.

See also: Adding a task or dataset.

Preprocessing

Step	Detail
Task module	xlm.tasks.tinygsm.tinygsm_preprocess_fn
Outputs	`prompt_token_ids` (question + separator), `input_token_ids` (code)
On-the-fly processor	`xlm.datamodule.token_ids_to_input_ids_and_prompt_ids`
Val split	5% holdout via `train_test_split` with `seed: 2025`, `size: 0.05` on the HF `train` split

Hydra configs (`src/xlm`)

Config	Path
Base dataset	datasets/tinygsm.yaml
Train	datasets/tinygsm_train.yaml
Val (loss)	datasets/tinygsm_val.yaml
Val (prediction)	datasets/tinygsm_val_pred.yaml — `tinygsm_pred_preprocess_fn`, code-exec post-hoc
GSM8K test	datasets/gsm8k_test_pred.yaml
Datamodule skeleton	datamodule/tinygsm.yaml

GSM8K and code-execution eval: tinygsm_gsm8k.md.

Model experiments

Training settings, prepare/train commands, and experiment YAMLs live in the per-model docs:

Model	Experiment	Doc
FlexMDM	`experiment=tinygsm_flexmdm`	FlexMDM — TinyGSM (debug: `debug=overfit_tinygsm_flexmdm`)
MLM	`experiment=tinygsm_mlm`	MLM — TinyGSM
ARLM	`experiment=tinygsm_arlm`	ARLM — TinyGSM