Skip to content

xlm.tasks.math500

MATH-500 evaluation task.

Provides: - Math500Eval: post-hoc evaluator for the MATH-500 benchmark. - math500_preprocess_fn: dataset map function that constructs fewshot prompts and tokenizes them for use as a prediction dataloader.

Dependencies (install when using this module): pip install math_verify

Math500Eval

Post-hoc evaluator for the MATH-500 benchmark.

Reads predictions with text (model generation) and truth (gold answer), extracts mathematical expressions from both, and checks equivalence using the math_verify library.

Uses the same verification logic as prd2's math_verify_utils.process_results: math_verify.parse to extract a structured answer, then math_verify.verify to check equivalence.

Hydra config example::

post_hoc_evaluator:
  _target_: xlm.tasks.math500.Math500Eval

Or inside a CompositePostHocEvaluator::

post_hoc_evaluator:
  _target_: xlm.tasks.composite_eval.CompositePostHocEvaluator
  evaluators:
    math500_prediction:
      _target_: xlm.tasks.math500.Math500Eval

eval(predictions, tokenizer=None, **kwargs)

Score each prediction against the gold answer.

Parameters:

Name Type Description Default
predictions List[Dict[str, Any]]

List of dicts, each containing at least: - text: raw model output string - truth: gold answer string (e.g. "$42$")

required
tokenizer Any

Unused, kept for interface consistency.

None

Returns:

Type Description
List[Dict[str, Any]]

(predictions, aggregated_metrics) — predictions are updated

Dict[str, Any]

in-place with parsed_answer, parsed_gold, and correct

Tuple[List[Dict[str, Any]], Dict[str, Any]]

fields.

math500_preprocess_fn(example, tokenizer, *, num_fewshot=4, dataset_path='HuggingFaceH4/MATH-500', fewshot_split='test', fewshot_seed=42, block_size=None)

Build a fewshot prompt for a MATH-500 example and tokenize it.

Intended for use as a HuggingFace datasets.Dataset.map function via the xlm-core datamodule preprocess_function / on_the_fly_processor config knob.

The prompt format matches the lm-evaluation-harness yaml for MATH-500 (discrete-diffusion/src/dd/tasks/math500/math500.yaml)::

Problem: {problem_1}
Answer:{solution_1}

...

Problem: {problem_N}
Answer:{solution_N}

Problem: {current_problem}
Answer:

Gold answer is stored in answer so that LogPredictions can carry it through to the predictions JSONL via additional_fields_from_batch, where Math500Eval picks it up as truth.

Parameters:

Name Type Description Default
example Dict[str, Any]

A single dataset row with problem, solution, and answer fields.

required
tokenizer Any

Tokenizer with an encode method.

required
num_fewshot int

Number of fewshot examples to prepend.

4
dataset_path str

HuggingFace dataset path for fewshot examples.

'HuggingFaceH4/MATH-500'
fewshot_split str

Split to draw fewshot examples from.

'test'
fewshot_seed int

Seed for shuffling fewshot pool.

42
block_size Optional[int]

If set, truncate prompt_ids to this length.

None

Returns:

Type Description
Dict[str, Any]

Dict with prompt_ids (prompt), target_ids (suffix; empty for

Dict[str, Any]

prompt-only prediction), and answer (str).

Ref: https://github.com/dhruvdcoder/discrete-diffusion/src/dd/tasks/math500/math_verify_utils.py