Skip to content

xlm.tasks.owt.generative_perplexity_post_hoc

Post-hoc generative perplexity (judge LM) evaluation for Harness predictions.

Reads logged prediction rows (same JSONL as other post-hoc evaluators), scores text with one or more :class:~xlm.generative_perplexity.GenerativePerplexityEvaluator instances, and returns per-row fields plus aggregated metrics.

Judge LMs use default_judge_device / per-evaluator overrides in this class's config—not the training module's device.

GenerativePerplexityPostHocEval

Score generated text with external causal LMs (generative perplexity judges).

__init__(evaluators, default_judge_device='cuda', judge_devices=None, metric_prefix='')

Parameters:

Name Type Description Default
evaluators Dict[str, GenerativePerplexityEvaluator]

Names -> instantiated GenerativePerplexityEvaluator objects.

required
default_judge_device str

Device string for judges (e.g. "cuda:1", "cpu"). Use "auto" for CUDA if available else CPU.

'cuda'
judge_devices Optional[Dict[str, str]]

Optional per-evaluator device overrides (name -> device string).

None
metric_prefix str

Optional prefix for every key in aggregated_metrics.

''