xlm.tasks.tinygsm.gsm8k
GSM8K test evaluation for TinyGSM-trained code generators.
Preprocessing mirrors TinyGSM seq2seq layout but uses GSM8K fields (question,
answer) and leaves the suffix empty at inference. Scoring executes generated
Python and compares numeric results, following PUMA's gsm8k_eval.py:
https://github.com/JaeyeonKim01/PUMA/blob/main/eval/gsm8k_eval.py
Gsm8kCodeEval
Post-hoc evaluator: execute generated code and compare to GSM8K gold.
Expects prediction rows with generated_text (suffix-only decode; preferred)
or text (full sequence), plus answer or truth (numeric gold).
Hydra::
post_hoc_evaluator:
_target_: xlm.tasks.tinygsm.Gsm8kCodeEval
extract_gsm8k_final_answer(ans_text)
Extract the numeric final answer from a GSM8K answer field.
GSM8K answers end with #### 72. Falls back to the last number in the
string if the marker is missing.
gsm8k_preprocess_fn(example, tokenizer, *, sep='\n')
Tokenize GSM8K test rows for seq2seq MDM prediction.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
example
|
Dict[str, Any]
|
HF row with |
required |
tokenizer
|
PreTrainedTokenizerBase
|
Hugging Face tokenizer ( |
required |
sep
|
str
|
String between question and generated code region (PUMA default). |
'\n'
|
Returns:
| Type | Description |
|---|---|
Dict[str, Any]
|
Updated example with |
Dict[str, Any]
|
and |
execute_tinygsm_code(code, timeout_s=1.0)
Run reference TinyGSM code and return simple_math_problem() value.
gold_answer_from_tinygsm_code(code, timeout_s=1.0)
Numeric gold string for post-hoc eval (empty if reference code fails).
tinygsm_pred_preprocess_fn(example, tokenizer, *, sep='\n', gold_timeout_s=1.0)
TinyGSM rows for seq2seq prediction: question prefix, empty suffix, numeric gold.
Gold is computed once by executing the reference code (PUMA/TinyGSM convention).
evaluate_samples(sample, answer, timeout_s=1.0)
Return True if executing sample yields the gold numeric answer.
prediction_code_text(pred)
Return model output text to execute for code-exec scoring.
Prefers generated_text (suffix-only decode from FlexMDM). Falls back to
text (full sequence including the question prefix) for older logs.
evaluate_sample_with_details(sample, answer, timeout_s=1.0)
Like evaluate_samples but returns (correct, pred_value, error).