xlm.tasks.math500
MATH-500 evaluation task.
Provides:
- Math500Eval: post-hoc evaluator for the MATH-500 benchmark.
- math500_preprocess_fn: dataset map function that constructs fewshot
prompts and tokenizes them for use as a prediction dataloader.
Dependencies (install when using this module): pip install math_verify
Math500Eval
Post-hoc evaluator for the MATH-500 benchmark.
Reads predictions with text (model generation) and truth (gold
answer), extracts mathematical expressions from both, and checks
equivalence using the math_verify library.
Uses the same verification logic as prd2's
math_verify_utils.process_results: math_verify.parse to extract
a structured answer, then math_verify.verify to check equivalence.
Hydra config example::
post_hoc_evaluator:
_target_: xlm.tasks.math500.Math500Eval
Or inside a CompositePostHocEvaluator::
post_hoc_evaluator:
_target_: xlm.tasks.composite_eval.CompositePostHocEvaluator
evaluators:
math500_prediction:
_target_: xlm.tasks.math500.Math500Eval
eval(predictions, tokenizer=None, **kwargs)
Score each prediction against the gold answer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
predictions
|
List[Dict[str, Any]]
|
List of dicts, each containing at least:
- |
required |
tokenizer
|
Any
|
Unused, kept for interface consistency. |
None
|
Returns:
| Type | Description |
|---|---|
List[Dict[str, Any]]
|
|
Dict[str, Any]
|
in-place with |
Tuple[List[Dict[str, Any]], Dict[str, Any]]
|
fields. |
math500_preprocess_fn(example, tokenizer, *, num_fewshot=4, dataset_path='HuggingFaceH4/MATH-500', fewshot_split='test', fewshot_seed=42, block_size=None)
Build a fewshot prompt for a MATH-500 example and tokenize it.
Intended for use as a HuggingFace datasets.Dataset.map function via
the xlm-core datamodule preprocess_function / on_the_fly_processor
config knob.
The prompt format matches the lm-evaluation-harness yaml for MATH-500
(discrete-diffusion/src/dd/tasks/math500/math500.yaml)::
Problem: {problem_1}
Answer:{solution_1}
...
Problem: {problem_N}
Answer:{solution_N}
Problem: {current_problem}
Answer:
Gold answer is stored in answer so that LogPredictions can carry
it through to the predictions JSONL via additional_fields_from_batch,
where Math500Eval picks it up as truth.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
example
|
Dict[str, Any]
|
A single dataset row with |
required |
tokenizer
|
Any
|
Tokenizer with an |
required |
num_fewshot
|
int
|
Number of fewshot examples to prepend. |
4
|
dataset_path
|
str
|
HuggingFace dataset path for fewshot examples. |
'HuggingFaceH4/MATH-500'
|
fewshot_split
|
str
|
Split to draw fewshot examples from. |
'test'
|
fewshot_seed
|
int
|
Seed for shuffling fewshot pool. |
42
|
block_size
|
Optional[int]
|
If set, truncate |
None
|
Returns:
| Type | Description |
|---|---|
Dict[str, Any]
|
Dict with |
Dict[str, Any]
|
prompt-only prediction), and |
Ref: https://github.com/dhruvdcoder/discrete-diffusion/src/dd/tasks/math500/math_verify_utils.py