`xlm.tasks.tinygsm`

Preprocessing for TinyGSM/TinyGSM (math word problems + Python solutions).

Field layout and train/val split semantics follow PUMA's tiny_gsm.py: https://github.com/JaeyeonKim01/PUMA/blob/main/data/tiny_gsm.py

Each example is split into a prefix (question + separator) and a suffix (code) for seq2seq MDM training via prompt_token_ids / input_token_ids and an on-the-fly processor that maps to prompt_ids / input_ids. Wire a seq2seq collator (e.g. MLMSeq2SeqTrainCollator) in the model experiment.

GSM8K test evaluation (code execution scoring) lives in :mod:gsm8k — see gsm8k_preprocess_fn and Gsm8kCodeEval (PUMA gsm8k_eval.py).

Memmap pretokenization (pretokenize_tinygsm, labels.bin, prompt_mask.bin, TinyGSMDataset) is not supported and will not be added. Data flows only through DatasetManager + prepare_data + iterable shards.

Padding/truncation to a fixed block size is handled by the collator, not here. PUMA pads with EOS; xlm collators use pad_token_id unless the experiment sets loss_on_padding or pad=eos on the tokenizer.

`Gsm8kCodeEval`

Post-hoc evaluator: execute generated code and compare to GSM8K gold.

Expects prediction rows with generated_text (suffix-only decode; preferred) or text (full sequence), plus answer or truth (numeric gold).

Hydra::

post_hoc_evaluator:
  _target_: xlm.tasks.tinygsm.Gsm8kCodeEval

`evaluate_samples(sample, answer, timeout_s=1.0)`

Return True if executing sample yields the gold numeric answer.

`extract_gsm8k_final_answer(ans_text)`

Extract the numeric final answer from a GSM8K answer field.

GSM8K answers end with #### 72. Falls back to the last number in the string if the marker is missing.

`gold_answer_from_tinygsm_code(code, timeout_s=1.0)`

Numeric gold string for post-hoc eval (empty if reference code fails).

`gsm8k_preprocess_fn(example, tokenizer, *, sep='\n')`

Tokenize GSM8K test rows for seq2seq MDM prediction.

Parameters:

Name	Type	Description	Default
`example`	`Dict[str, Any]`	HF row with `question` and `answer` fields.	required
`tokenizer`	`PreTrainedTokenizerBase`	Hugging Face tokenizer (`encode`, no special tokens).	required
`sep`	`str`	String between question and generated code region (PUMA default).	`'\n'`

Returns:

Type	Description
`Dict[str, Any]`	Updated example with `prompt_token_ids`, empty `input_token_ids`,
`Dict[str, Any]`	and `answer` set to the numeric gold string.

`tinygsm_pred_preprocess_fn(example, tokenizer, *, sep='\n', gold_timeout_s=1.0)`

TinyGSM rows for seq2seq prediction: question prefix, empty suffix, numeric gold.

Gold is computed once by executing the reference code (PUMA/TinyGSM convention).

`reset_tinygsm_debug_first_example_filter_fn()`

Reset :func:tinygsm_debug_first_example_filter_fn state (for tests).

`tinygsm_debug_first_example_filter_fn(example)`

Keep only the first TinyGSM row when building debug manual caches.

Used with filter_suffix: debug_one in flexmdm debug dataset configs. Run prepare_data with num_dataset_workers=1 so Dataset.filter is single-process; multiprocessing can drop or duplicate rows.

`tinygsm_preprocess_fn(example, tokenizer, *, sep='\n')`

Tokenize TinyGSM rows into prefix/suffix token id lists.

Parameters:

Name	Type	Description	Default
`example`	`Dict[str, Any]`	HF row with `question` and `code` fields.	required
`tokenizer`	`PreTrainedTokenizerBase`	Hugging Face tokenizer (`encode`, no special tokens).	required
`sep`	`str`	String between question and code (PUMA default: newline).	`'\n'`