Skip to content

TinyGSM

TinyGSM/TinyGSM on Hugging Face provides roughly 11.8M training examples: math word problems (question) and Python solutions (code). In xlm-core each example is wired as a seq2seq MDM task: the prefix is question + "\n" and the suffix is code, following the field layout in PUMA tiny_gsm.py.

Memmap pretokenization is not supported (labels.bin, prompt_mask.bin, and related offline paths). Data flows only through DatasetManager, job_type=prepare_data, and iterable shards at train time.

See also: Adding a task or dataset.

Preprocessing

Step Detail
Task module xlm.tasks.tinygsm.tinygsm_preprocess_fn
Outputs prompt_token_ids (question + separator), input_token_ids (code)
On-the-fly processor xlm.datamodule.token_ids_to_input_ids_and_prompt_ids
Val split 5% holdout via train_test_split with seed: 2025, size: 0.05 on the HF train split

Hydra configs (src/xlm)

Config Path
Base dataset datasets/tinygsm.yaml
Train datasets/tinygsm_train.yaml
Val (loss) datasets/tinygsm_val.yaml
Val (prediction) datasets/tinygsm_val_pred.yamltinygsm_pred_preprocess_fn, code-exec post-hoc
GSM8K test datasets/gsm8k_test_pred.yaml
Datamodule skeleton datamodule/tinygsm.yaml

GSM8K and code-execution eval: tinygsm_gsm8k.md.

Model experiments

Training settings, prepare/train commands, and experiment YAMLs live in the per-model docs:

Model Experiment Doc
FlexMDM experiment=tinygsm_flexmdm FlexMDM — TinyGSM (debug: debug=overfit_tinygsm_flexmdm)
MLM experiment=tinygsm_mlm MLM — TinyGSM
ARLM experiment=tinygsm_arlm ARLM — TinyGSM