Skip to content

xlm.tasks.owt.mauve_text_eval

MAUVE post-hoc text evaluation for xlm-core Harness / LogPredictions.

Computes MAUVE <https://arxiv.org/abs/2102.01454> between human/reference strings and model generations using mauve-text (import mauve).

Human / reference text can come from:

  1. Each prediction row (truth, reference, …, or reference_field), including decoded target_ids when the tokenizer is passed; or
  2. human_text_source: hf_streaming — stream strings from a HuggingFace dataset split (default: OWT validation), same idea as the standalone Proseco eval (human side from the val loader, not the JSONL).

Example Hydra defaults::

defaults:
  - your_experiment
  - /post_hoc_evaluator@post_hoc_evaluator.evaluators.prediction.mauve: mauve_text

Or instantiate explicitly (no composite)::

post_hoc_evaluator:
  _target_: xlm.tasks.owt.mauve_text_eval.MauveTextEval

Install::

pip install "xlm-core[mauve]"

(or pip install mauve-text).

.. _mauve-text: https://pypi.org/project/mauve-text/

MauveTextEval

Post-hoc evaluator: MAUVE between references and model text.

Parameters:

Name Type Description Default
reference_field Optional[str]

Batch / prediction key for human text. If None, the first non-empty among truth, reference, target_text, ground_truth_middle, etc., or decoded target_ids when tokenizer is passed.

None
generated_field str

Key for model output (default text).

'text'
featurize_model_name str

HF model name for MAUVE features (see mauve-text).

'gpt2-large'
device_id int

GPU id for featurization, or -1 for CPU.

0
max_text_length int

Max tokens per string for the featurizer.

256
batch_size int

Featurization batch size.

8
verbose bool

Forwarded to mauve.compute_mauve.

False
num_buckets Any

Histogram size ("auto" or int).

'auto'
seed int

RNG seed for k-means.

25
swap_p_q bool

If True, treat generations as p_text and references as q_text (library default is human p, machine q).

False
human_text_source Optional[str]

If "hf_streaming", build p_text from a HF dataset split instead of per-row references (Proseco-style). If None, use truth / reference_field / etc. on each row.

None
hf_dataset_path str

Dataset id for streaming (OWT default).

'dhruveshpatel/owt-gpt2-1024-split'
hf_split str

Split name, e.g. validation.

'validation'
hf_text_column str

Column with raw text.

'text'
hf_shuffle_seed int

Seed for streaming shuffle.

42
hf_shuffle_buffer_size int

Shuffle buffer for streaming.

10000
hf_min_chars int

Skip shorter snippets.

8