`xlm.tasks.math500`

MATH-500 evaluation task.

Provides: - Math500Eval: post-hoc evaluator for the MATH-500 benchmark. - math500_preprocess_fn: dataset map function that constructs fewshot prompts and tokenizes them for use as a prediction dataloader.

Dependencies (install when using this module): pip install math_verify

`Math500Eval`

Post-hoc evaluator for the MATH-500 benchmark.

Reads predictions with text (model generation) and truth (gold answer), extracts mathematical expressions from both, and checks equivalence using the math_verify library.

Uses the same verification logic as prd2's math_verify_utils.process_results: math_verify.parse to extract a structured answer, then math_verify.verify to check equivalence.

Hydra config example::

post_hoc_evaluator:
  _target_: xlm.tasks.math500.Math500Eval

Or inside a CompositePostHocEvaluator::

post_hoc_evaluator:
  _target_: xlm.tasks.composite_eval.CompositePostHocEvaluator
  evaluators:
    math500_prediction:
      _target_: xlm.tasks.math500.Math500Eval

`eval(predictions, tokenizer=None, **kwargs)`

Score each prediction against the gold answer.

Parameters:

Name	Type	Description	Default
`predictions`	`List[Dict[str, Any]]`	List of dicts, each containing at least: - `text`: raw model output string - `truth`: gold answer string (e.g. `"$42$"`)	required
`tokenizer`	`Any`	Unused, kept for interface consistency.	`None`

Returns:

Type	Description
`List[Dict[str, Any]]`	`(predictions, aggregated_metrics)` — predictions are updated
`Dict[str, Any]`	in-place with `parsed_answer`, `parsed_gold`, and `correct`
`Tuple[List[Dict[str, Any]], Dict[str, Any]]`	fields.

`math500_preprocess_fn(example, tokenizer, *, num_fewshot=4, dataset_path='HuggingFaceH4/MATH-500', fewshot_split='test', fewshot_seed=42, block_size=None)`

Build a fewshot prompt for a MATH-500 example and tokenize it.

Intended for use as a HuggingFace datasets.Dataset.map function via the xlm-core datamodule preprocess_function / on_the_fly_processor config knob.

The prompt format matches the lm-evaluation-harness yaml for MATH-500 (discrete-diffusion/src/dd/tasks/math500/math500.yaml)::

Problem: {problem_1}
Answer:{solution_1}

...

Problem: {problem_N}
Answer:{solution_N}

Problem: {current_problem}
Answer:

Gold answer is stored in answer so that LogPredictions can carry it through to the predictions JSONL via additional_fields_from_batch, where Math500Eval picks it up as truth.

Parameters:

Name	Type	Description	Default
`example`	`Dict[str, Any]`	A single dataset row with `problem`, `solution`, and `answer` fields.	required
`tokenizer`	`Any`	Tokenizer with an `encode` method.	required
`num_fewshot`	`int`	Number of fewshot examples to prepend.	`4`
`dataset_path`	`str`	HuggingFace dataset path for fewshot examples.	`'HuggingFaceH4/MATH-500'`
`fewshot_split`	`str`	Split to draw fewshot examples from.	`'test'`
`fewshot_seed`	`int`	Seed for shuffling fewshot pool.	`42`
`block_size`	`Optional[int]`	If set, truncate `prompt_ids` to this length.	`None`

Returns:

Type	Description
`Dict[str, Any]`	Dict with `prompt_ids` (prompt), `target_ids` (suffix; empty for
`Dict[str, Any]`	prompt-only prediction), and `answer` (str).

Ref: https://github.com/dhruvdcoder/discrete-diffusion/src/dd/tasks/math500/math_verify_utils.py