xlm.tasks.tinygsm
Preprocessing for TinyGSM/TinyGSM (math word problems + Python solutions).
Field layout and train/val split semantics follow PUMA's tiny_gsm.py: https://github.com/JaeyeonKim01/PUMA/blob/main/data/tiny_gsm.py
Each example is split into a prefix (question + separator) and a suffix (code)
for seq2seq MDM training via prompt_token_ids / input_token_ids and an
on-the-fly processor that maps to prompt_ids / input_ids. Wire a
seq2seq collator (e.g. MLMSeq2SeqTrainCollator) in the model experiment.
GSM8K test evaluation (code execution scoring) lives in :mod:gsm8k — see
gsm8k_preprocess_fn and Gsm8kCodeEval (PUMA gsm8k_eval.py).
Memmap pretokenization (pretokenize_tinygsm, labels.bin,
prompt_mask.bin, TinyGSMDataset) is not supported and will not be added.
Data flows only through DatasetManager + prepare_data + iterable shards.
Padding/truncation to a fixed block size is handled by the collator, not here.
PUMA pads with EOS; xlm collators use pad_token_id unless the experiment
sets loss_on_padding or pad=eos on the tokenizer.
Gsm8kCodeEval
Post-hoc evaluator: execute generated code and compare to GSM8K gold.
Expects prediction rows with generated_text (suffix-only decode; preferred)
or text (full sequence), plus answer or truth (numeric gold).
Hydra::
post_hoc_evaluator:
_target_: xlm.tasks.tinygsm.Gsm8kCodeEval
evaluate_samples(sample, answer, timeout_s=1.0)
Return True if executing sample yields the gold numeric answer.
extract_gsm8k_final_answer(ans_text)
Extract the numeric final answer from a GSM8K answer field.
GSM8K answers end with #### 72. Falls back to the last number in the
string if the marker is missing.
gold_answer_from_tinygsm_code(code, timeout_s=1.0)
Numeric gold string for post-hoc eval (empty if reference code fails).
gsm8k_preprocess_fn(example, tokenizer, *, sep='\n')
Tokenize GSM8K test rows for seq2seq MDM prediction.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
example
|
Dict[str, Any]
|
HF row with |
required |
tokenizer
|
PreTrainedTokenizerBase
|
Hugging Face tokenizer ( |
required |
sep
|
str
|
String between question and generated code region (PUMA default). |
'\n'
|
Returns:
| Type | Description |
|---|---|
Dict[str, Any]
|
Updated example with |
Dict[str, Any]
|
and |
tinygsm_pred_preprocess_fn(example, tokenizer, *, sep='\n', gold_timeout_s=1.0)
TinyGSM rows for seq2seq prediction: question prefix, empty suffix, numeric gold.
Gold is computed once by executing the reference code (PUMA/TinyGSM convention).
reset_tinygsm_debug_first_example_filter_fn()
Reset :func:tinygsm_debug_first_example_filter_fn state (for tests).
tinygsm_debug_first_example_filter_fn(example)
Keep only the first TinyGSM row when building debug manual caches.
Used with filter_suffix: debug_one in flexmdm debug dataset configs.
Run prepare_data with num_dataset_workers=1 so Dataset.filter is
single-process; multiprocessing can drop or duplicate rows.
tinygsm_preprocess_fn(example, tokenizer, *, sep='\n')
Tokenize TinyGSM rows into prefix/suffix token id lists.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
example
|
Dict[str, Any]
|
HF row with |
required |
tokenizer
|
PreTrainedTokenizerBase
|
Hugging Face tokenizer ( |
required |
sep
|
str
|
String between question and code (PUMA default: newline). |
'\n'
|