xlm.tasks.composite_eval
Composite post-hoc evaluator that routes to task-specific evaluators.
Usage in Hydra config::
post_hoc_evaluator:
_target_: xlm.tasks.composite_eval.CompositePostHocEvaluator
evaluators:
math500_prediction:
_target_: xlm.tasks.math500.Math500Eval
denovo_prediction:
_target_: xlm.tasks.safe_molgen.DeNovoEval
use_bracket_safe: true
For one dataloader pattern you may use a dict of named sub-evaluators (compose in YAML; run order is key order)::
evaluators:
prediction:
mauve:
_target_: xlm.tasks.owt.mauve_text_eval.MauveTextEval
gen_ppl:
_target_: xlm.tasks.owt.generative_perplexity_post_hoc.GenerativePerplexityPostHocEval
...
A list of evaluators is still supported for the same pattern. A single evaluator
instance is unchanged. Returned aggregated_metrics dicts are merged (duplicate keys:
later sub-evaluator wins, with a warning).
CompositePostHocEvaluator
Routes eval() calls to task-specific evaluator(s) chosen by dataloader name.
The evaluators dict maps a pattern (substring) to one evaluator instance, a
list of instances, or a dict of name → instance (names are for structure and
ordering only; run order follows dict / YAML key order).
When eval() is called with a dataloader_name, the first pattern that is a
substring of the name is selected. If nothing matches, the predictions are returned
unchanged with empty metrics.
This is a drop-in replacement for a single evaluator: the existing
Harness.compute_post_hoc_metrics passes dataloader_name through,
and evaluators that don't use it simply ignore the kwarg.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
evaluators
|
Dict[str, Any]
|
Mapping from dataloader-name substring to one evaluator, a list of
evaluators, or a dict of evaluators. Each must implement
|
required |