Skip to content

xlm.tasks.tinygsm.gsm8k

GSM8K test evaluation for TinyGSM-trained code generators.

Preprocessing mirrors TinyGSM seq2seq layout but uses GSM8K fields (question, answer) and leaves the suffix empty at inference. Scoring executes generated Python and compares numeric results, following PUMA's gsm8k_eval.py: https://github.com/JaeyeonKim01/PUMA/blob/main/eval/gsm8k_eval.py

Gsm8kCodeEval

Post-hoc evaluator: execute generated code and compare to GSM8K gold.

Expects prediction rows with generated_text (suffix-only decode; preferred) or text (full sequence), plus answer or truth (numeric gold).

Hydra::

post_hoc_evaluator:
  _target_: xlm.tasks.tinygsm.Gsm8kCodeEval

extract_gsm8k_final_answer(ans_text)

Extract the numeric final answer from a GSM8K answer field.

GSM8K answers end with #### 72. Falls back to the last number in the string if the marker is missing.

gsm8k_preprocess_fn(example, tokenizer, *, sep='\n')

Tokenize GSM8K test rows for seq2seq MDM prediction.

Parameters:

Name Type Description Default
example Dict[str, Any]

HF row with question and answer fields.

required
tokenizer PreTrainedTokenizerBase

Hugging Face tokenizer (encode, no special tokens).

required
sep str

String between question and generated code region (PUMA default).

'\n'

Returns:

Type Description
Dict[str, Any]

Updated example with prompt_token_ids, empty input_token_ids,

Dict[str, Any]

and answer set to the numeric gold string.

execute_tinygsm_code(code, timeout_s=1.0)

Run reference TinyGSM code and return simple_math_problem() value.

gold_answer_from_tinygsm_code(code, timeout_s=1.0)

Numeric gold string for post-hoc eval (empty if reference code fails).

tinygsm_pred_preprocess_fn(example, tokenizer, *, sep='\n', gold_timeout_s=1.0)

TinyGSM rows for seq2seq prediction: question prefix, empty suffix, numeric gold.

Gold is computed once by executing the reference code (PUMA/TinyGSM convention).

evaluate_samples(sample, answer, timeout_s=1.0)

Return True if executing sample yields the gold numeric answer.

prediction_code_text(pred)

Return model output text to execute for code-exec scoring.

Prefers generated_text (suffix-only decode from FlexMDM). Falls back to text (full sequence including the question prefix) for older logs.

evaluate_sample_with_details(sample, answer, timeout_s=1.0)

Like evaluate_samples but returns (correct, pred_value, error).