`xlm.tasks.tinygsm.gsm8k`

GSM8K test evaluation for TinyGSM-trained code generators.

Preprocessing mirrors TinyGSM seq2seq layout but uses GSM8K fields (question, answer) and leaves the suffix empty at inference. Scoring executes generated Python and compares numeric results, following PUMA's gsm8k_eval.py: https://github.com/JaeyeonKim01/PUMA/blob/main/eval/gsm8k_eval.py

`Gsm8kCodeEval`

Post-hoc evaluator: execute generated code and compare to GSM8K gold.

Expects prediction rows with generated_text (suffix-only decode; preferred) or text (full sequence), plus answer or truth (numeric gold).

Hydra::

post_hoc_evaluator:
  _target_: xlm.tasks.tinygsm.Gsm8kCodeEval

`extract_gsm8k_final_answer(ans_text)`

Extract the numeric final answer from a GSM8K answer field.

GSM8K answers end with #### 72. Falls back to the last number in the string if the marker is missing.

`gsm8k_preprocess_fn(example, tokenizer, *, sep='\n')`

Tokenize GSM8K test rows for seq2seq MDM prediction.

Parameters:

Name	Type	Description	Default
`example`	`Dict[str, Any]`	HF row with `question` and `answer` fields.	required
`tokenizer`	`PreTrainedTokenizerBase`	Hugging Face tokenizer (`encode`, no special tokens).	required
`sep`	`str`	String between question and generated code region (PUMA default).	`'\n'`

Returns:

Type	Description
`Dict[str, Any]`	Updated example with `prompt_token_ids`, empty `input_token_ids`,
`Dict[str, Any]`	and `answer` set to the numeric gold string.

`execute_tinygsm_code(code, timeout_s=1.0)`

Run reference TinyGSM code and return simple_math_problem() value.

`gold_answer_from_tinygsm_code(code, timeout_s=1.0)`

Numeric gold string for post-hoc eval (empty if reference code fails).

`tinygsm_pred_preprocess_fn(example, tokenizer, *, sep='\n', gold_timeout_s=1.0)`

TinyGSM rows for seq2seq prediction: question prefix, empty suffix, numeric gold.

Gold is computed once by executing the reference code (PUMA/TinyGSM convention).

`evaluate_samples(sample, answer, timeout_s=1.0)`

Return True if executing sample yields the gold numeric answer.

`prediction_code_text(pred)`

Return model output text to execute for code-exec scoring.

Prefers generated_text (suffix-only decode from FlexMDM). Falls back to text (full sequence including the question prefix) for older logs.

`evaluate_sample_with_details(sample, answer, timeout_s=1.0)`

Like evaluate_samples but returns (correct, pred_value, error).