Skip to content

xlm.tasks.composite_eval

Composite post-hoc evaluator that routes to task-specific evaluators.

Usage in Hydra config::

post_hoc_evaluator:
  _target_: xlm.tasks.composite_eval.CompositePostHocEvaluator
  evaluators:
    math500_prediction:
      _target_: xlm.tasks.math500.Math500Eval
    denovo_prediction:
      _target_: xlm.tasks.safe_molgen.DeNovoEval
      use_bracket_safe: true

For one dataloader pattern you may use a dict of named sub-evaluators (compose in YAML; run order is key order)::

evaluators:
  prediction:
    mauve:
      _target_: xlm.tasks.owt.mauve_text_eval.MauveTextEval
    gen_ppl:
      _target_: xlm.tasks.owt.generative_perplexity_post_hoc.GenerativePerplexityPostHocEval
      ...

A list of evaluators is still supported for the same pattern. A single evaluator instance is unchanged. Returned aggregated_metrics dicts are merged (duplicate keys: later sub-evaluator wins, with a warning).

CompositePostHocEvaluator

Routes eval() calls to task-specific evaluator(s) chosen by dataloader name.

The evaluators dict maps a pattern (substring) to one evaluator instance, a list of instances, or a dict of name → instance (names are for structure and ordering only; run order follows dict / YAML key order).

When eval() is called with a dataloader_name, the first pattern that is a substring of the name is selected. If nothing matches, the predictions are returned unchanged with empty metrics.

This is a drop-in replacement for a single evaluator: the existing Harness.compute_post_hoc_metrics passes dataloader_name through, and evaluators that don't use it simply ignore the kwarg.

Parameters:

Name Type Description Default
evaluators Dict[str, Any]

Mapping from dataloader-name substring to one evaluator, a list of evaluators, or a dict of evaluators. Each must implement eval(predictions, tokenizer=..., **kwargs).

required