Skip to content

xlm.tasks.tinygsm

Preprocessing for TinyGSM/TinyGSM (math word problems + Python solutions).

Field layout and train/val split semantics follow PUMA's tiny_gsm.py: https://github.com/JaeyeonKim01/PUMA/blob/main/data/tiny_gsm.py

Each example is split into a prefix (question + separator) and a suffix (code) for seq2seq MDM training via prompt_token_ids / input_token_ids and an on-the-fly processor that maps to prompt_ids / input_ids. Wire a seq2seq collator (e.g. MLMSeq2SeqTrainCollator) in the model experiment.

GSM8K test evaluation (code execution scoring) lives in :mod:gsm8k — see gsm8k_preprocess_fn and Gsm8kCodeEval (PUMA gsm8k_eval.py).

Memmap pretokenization (pretokenize_tinygsm, labels.bin, prompt_mask.bin, TinyGSMDataset) is not supported and will not be added. Data flows only through DatasetManager + prepare_data + iterable shards.

Padding/truncation to a fixed block size is handled by the collator, not here. PUMA pads with EOS; xlm collators use pad_token_id unless the experiment sets loss_on_padding or pad=eos on the tokenizer.

Gsm8kCodeEval

Post-hoc evaluator: execute generated code and compare to GSM8K gold.

Expects prediction rows with generated_text (suffix-only decode; preferred) or text (full sequence), plus answer or truth (numeric gold).

Hydra::

post_hoc_evaluator:
  _target_: xlm.tasks.tinygsm.Gsm8kCodeEval

evaluate_samples(sample, answer, timeout_s=1.0)

Return True if executing sample yields the gold numeric answer.

extract_gsm8k_final_answer(ans_text)

Extract the numeric final answer from a GSM8K answer field.

GSM8K answers end with #### 72. Falls back to the last number in the string if the marker is missing.

gold_answer_from_tinygsm_code(code, timeout_s=1.0)

Numeric gold string for post-hoc eval (empty if reference code fails).

gsm8k_preprocess_fn(example, tokenizer, *, sep='\n')

Tokenize GSM8K test rows for seq2seq MDM prediction.

Parameters:

Name Type Description Default
example Dict[str, Any]

HF row with question and answer fields.

required
tokenizer PreTrainedTokenizerBase

Hugging Face tokenizer (encode, no special tokens).

required
sep str

String between question and generated code region (PUMA default).

'\n'

Returns:

Type Description
Dict[str, Any]

Updated example with prompt_token_ids, empty input_token_ids,

Dict[str, Any]

and answer set to the numeric gold string.

tinygsm_pred_preprocess_fn(example, tokenizer, *, sep='\n', gold_timeout_s=1.0)

TinyGSM rows for seq2seq prediction: question prefix, empty suffix, numeric gold.

Gold is computed once by executing the reference code (PUMA/TinyGSM convention).

reset_tinygsm_debug_first_example_filter_fn()

Reset :func:tinygsm_debug_first_example_filter_fn state (for tests).

tinygsm_debug_first_example_filter_fn(example)

Keep only the first TinyGSM row when building debug manual caches.

Used with filter_suffix: debug_one in flexmdm debug dataset configs. Run prepare_data with num_dataset_workers=1 so Dataset.filter is single-process; multiprocessing can drop or duplicate rows.

tinygsm_preprocess_fn(example, tokenizer, *, sep='\n')

Tokenize TinyGSM rows into prefix/suffix token id lists.

Parameters:

Name Type Description Default
example Dict[str, Any]

HF row with question and code fields.

required
tokenizer PreTrainedTokenizerBase

Hugging Face tokenizer (encode, no special tokens).

required
sep str

String between question and code (PUMA default: newline).

'\n'