xlm.tasks.owt.generative_perplexity_post_hoc
Post-hoc generative perplexity (judge LM) evaluation for Harness predictions.
Reads logged prediction rows (same JSONL as other post-hoc evaluators), scores
text with one or more :class:~xlm.generative_perplexity.GenerativePerplexityEvaluator
instances, and returns per-row fields plus aggregated metrics.
Judge LMs use default_judge_device / per-evaluator overrides in this class's
config—not the training module's device.
GenerativePerplexityPostHocEval
Score generated text with external causal LMs (generative perplexity judges).
__init__(evaluators, default_judge_device='cuda', judge_devices=None, metric_prefix='')
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
evaluators
|
Dict[str, GenerativePerplexityEvaluator]
|
Names -> instantiated |
required |
default_judge_device
|
str
|
Device string for judges (e.g. |
'cuda'
|
judge_devices
|
Optional[Dict[str, str]]
|
Optional per-evaluator device overrides (name -> device string). |
None
|
metric_prefix
|
str
|
Optional prefix for every key in |
''
|