OWT MLM

Dataset

Experiment config: experiment=owt_mlm (datamodule: owt_mlm).

Training and validation use OpenWebText, pre-tokenized with GPT-2 and filtered to sequences of at most 1,024 tokens. The processed split is hosted on Hugging Face as dhruveshpatel/owt-gpt2-1024-split: a 10k-example validation holdout (seed 2357) and the remainder for training.

Setting	Value
Tokenizer	GPT-2 (`gpt2`)
Block size	1,024
Batching	Per-device batch size 32; global batch size 512
Train split	`dhruveshpatel/owt-gpt2-1024-split/train`
Val split	`dhruveshpatel/owt-gpt2-1024-split/validation`
Train collator	`DefaultMLMCollator` (random MLM masking, truncate to block, EOS appended)
Unconditional eval	`MLMEmptyDataset` (`unconditional_prediction` dataloader; empty prompts, max length 1,024)

Training runs for up to 1M steps with validation every 50k steps; checkpoints are saved every 2,500 steps (every 100k steps kept permanently).

Training

W&B run: owt_mlm (owt_mlm_v1)

xlm job_name=owt_mlm job_type=train experiment=owt_mlm \
  per_device_batch_size=32 trainer_strategy=ddp trainer.devices=8 trainer.num_nodes=1 \
  ++trainer.precision=bf16-mixed compile=true loggers=wandb \
  +loggers.wandb.resume=allow +loggers.wandb.id=owt_mlm_v1

Evaluation

Checkpoint is loaded from Hugging Face Hub (dhruveshpatel/mlm-owt, revision step-800000); set +hub.repo_id / +hub.revision for other checkpoints.

Single eval (set HUB_REVISION, TOP_P, and MAX_STEPS / sampling budget \(T\)):

HUB_REVISION=step-800000
TOP_P=0.95
MAX_STEPS=1024
STEP="${HUB_REVISION#step-}"

xlm job_name=owt_mlm_eval_${HUB_REVISION}_${TOP_P}_${MAX_STEPS} \
  job_type=eval experiment=owt_mlm \
  ++eval.checkpoint_path=None ++eval.split=validation \
  trainer_strategy=single_device ++trainer.precision=32-true compile=false \
  loggers=wandb +loggers.wandb.resume=allow +loggers.wandb.id=null \
  ~datamodule.dataset_managers.val.lm \
  +hub.repo_id=dhruveshpatel/mlm-owt +hub.revision=${HUB_REVISION} \
  +post_hoc_evaluator@post_hoc_evaluator.evaluators.prediction.gen_ppl=gen_ppl_gpt2_large \
  +post_hoc_evaluator@post_hoc_evaluator.evaluators.prediction.mauve=mauve_text \
  post_hoc_evaluator.evaluators.prediction.mauve.human_text_source=hf_streaming \
  ++predictor.top_p=${TOP_P} ++predictor.top_k=null predictor.max_steps=${MAX_STEPS} \
  +tags.checkpoint=${HUB_REVISION} +tags.step=${STEP} +tags.eval_type=nll \
  paths.log_dir=logs/eval \
  datamodule.dataset_managers.val.unconditional_prediction.num_examples=1000 \
  model.force_flash_attn=false \
  per_device_batch_size=16 global_batch_size=16

Reproduce the Results table below (\(p=0.95\), \(T \in \{128,256,512,1024\}\), checkpoint step-800000):

HUB_REVISION=step-800000
TOP_P=0.95
STEP="${HUB_REVISION#step-}"

for MAX_STEPS in 128 256 512 1024; do
  xlm job_name=owt_mlm_eval_${HUB_REVISION}_${TOP_P}_${MAX_STEPS} \
    job_type=eval experiment=owt_mlm \
    ++eval.checkpoint_path=None ++eval.split=validation \
    trainer_strategy=single_device ++trainer.precision=32-true compile=false \
    loggers=wandb +loggers.wandb.resume=allow +loggers.wandb.id=null \
    ~datamodule.dataset_managers.val.lm \
    +hub.repo_id=dhruveshpatel/mlm-owt +hub.revision=${HUB_REVISION} \
    +post_hoc_evaluator@post_hoc_evaluator.evaluators.prediction.gen_ppl=gen_ppl_gpt2_large \
    +post_hoc_evaluator@post_hoc_evaluator.evaluators.prediction.mauve=mauve_text \
    post_hoc_evaluator.evaluators.prediction.mauve.human_text_source=hf_streaming \
    ++predictor.top_p=${TOP_P} ++predictor.top_k=null predictor.max_steps=${MAX_STEPS} \
    +tags.checkpoint=${HUB_REVISION} +tags.step=${STEP} +tags.eval_type=nll \
    paths.log_dir=logs/eval \
    datamodule.dataset_managers.val.unconditional_prediction.num_examples=1000 \
    model.force_flash_attn=false \
    per_device_batch_size=16 global_batch_size=16
done

Results

Unconditional generation metrics for MLM (masked diffusion) checkpoints. Gen. PPL is with respect to GPT-2 Large; entropy is the vocabulary entropy of the generated token distribution. Evaluated with nucleus sampling (\(p=0.95\)) for predictor budgets \(T \in \{128, 256, 512, 1024\}\) on 1,000 samples up to 1,024 tokens.

Checkpoint	Gen. PPL (↓)				Entropy (↑)				MAUVE (↑)
	128	256	512	1024	128	256	512	1024	128	256	512	1024
800k	78.34	63.27	55.82	54.25	4.09	4.48	4.70	4.86	0.39	0.68	0.76	0.86

Source: W&B project rl-discrete-diffusion/correctors (tags.checkpoint=step-800000, top_p=0.95).