TinyGSM
TinyGSM/TinyGSM on Hugging Face provides roughly 11.8M training examples: math word problems (question) and Python solutions (code). In xlm-core each example is wired as a seq2seq MDM task: the prefix is question + "\n" and the suffix is code, following the field layout in PUMA tiny_gsm.py.
Memmap pretokenization is not supported (labels.bin, prompt_mask.bin, and related offline paths). Data flows only through DatasetManager, job_type=prepare_data, and iterable shards at train time.
See also: Adding a task or dataset.
Preprocessing
| Step | Detail |
|---|---|
| Task module | xlm.tasks.tinygsm.tinygsm_preprocess_fn |
| Outputs | prompt_token_ids (question + separator), input_token_ids (code) |
| On-the-fly processor | xlm.datamodule.token_ids_to_input_ids_and_prompt_ids |
| Val split | 5% holdout via train_test_split with seed: 2025, size: 0.05 on the HF train split |
Hydra configs (src/xlm)
| Config | Path |
|---|---|
| Base dataset | datasets/tinygsm.yaml |
| Train | datasets/tinygsm_train.yaml |
| Val (loss) | datasets/tinygsm_val.yaml |
| Val (prediction) | datasets/tinygsm_val_pred.yaml — tinygsm_pred_preprocess_fn, code-exec post-hoc |
| GSM8K test | datasets/gsm8k_test_pred.yaml |
| Datamodule skeleton | datamodule/tinygsm.yaml |
GSM8K and code-execution eval: tinygsm_gsm8k.md.
Model experiments
Training settings, prepare/train commands, and experiment YAMLs live in the per-model docs:
| Model | Experiment | Doc |
|---|---|---|
| FlexMDM | experiment=tinygsm_flexmdm |
FlexMDM — TinyGSM (debug: debug=overfit_tinygsm_flexmdm) |
| MLM | experiment=tinygsm_mlm |
MLM — TinyGSM |
| ARLM | experiment=tinygsm_arlm |
ARLM — TinyGSM |