Skip to content

mdlm.datamodule_mdlm

MDLMEmptyDataset

Bases: IterableDataset

__init__(tokenizer, num_examples, max_length)

Parameters:

Name Type Description Default
tokenizer_kwargs

Keyword arguments for the tokenizer.

required
TODO

Might want the option to add BOS.

required

DefaultMDLMCollator

Bases: Collator

Used for MDLM pre-training with padded-truncated sequences.

Batch
  1. input_ids: Integer[TT, " batch seq_len"]: The input for the model with masks.
  2. attention_mask: Integer[TT, " batch seq_len"]: 1 for tokens that are not padding.
  3. target_ids: Integer[TT, " batch seq_len"]: The target ids to the model where the input if copied as is and masks are replaced with the correct token.
Padding
  • Padding is done on the right.

MDLMSeq2SeqTrainCollator

Bases: Collator

MDLM training for seq2seq tasks.

Batch
  1. input_ids: Integer[TT, " batch seq_len"]: The input for the model with masks.
  2. attention_mask: Integer[TT, " batch seq_len"]: 1 for tokens that are not padding.
  3. target_ids: Integer[TT, " batch seq_len"]: The target ids to the model where the input if copied as is and masks are replaced with the correct token.
Padding
  • Padding is done on the right.

MDLMSeq2SeqPredCollator

Bases: Collator

Input contains only the prefix and target_ids contain only the suffix if present.

How is this different from MDLMSeq2SeqTrainCollator? - MDLMSeq2SeqTrainCollator's input_ids contain the joined sequence and target_ids also contain the target for the whole sequence. But MDLMSeq2SeqPredCollator's input_ids contain only the prefix and target_ids contain only the suffix if present.

Batch
  1. input_ids: Integer[TT, " batch seq_len"]: Input contains only the prefix
  2. attention_mask: Integer[TT, " batch seq_len"]: 1 for tokens that are not padding.
  3. target_ids: Integer[TT, " batch seq_len"]: Target contains only the suffix if present.
  4. noise_rate: Float[TT, " batch"]: The noise rate for the model.
  5. total_noise: Float[TT, " batch"]: The total noise for the model.
  6. t: Float[TT, " batch"]: The time step for the model.
Padding
  • There is padding on both sides because all prefixes end at the same position.

prepare_prefix_ids(prefix_ids, pad_token_id, max_seq_len=None, truncate='block')

Prepare prefix ids for seq2seq tasks.

Parameters:

Name Type Description Default
prefix_ids List[List[int]]

List[List[int]]

required
pad_token_id int

int

required
max_seq_len Optional[int]

Optional[int]

None
truncate Literal['max', 'block', None]
  • "max": Truncate to max(max_seq_len, max_in_batch).
    • when max_seq_len is not provided, it is the max in the batch.
  • "block": Pad-Truncate to max_seq_len.
  • None: Pad to max in the batch.
'block'

Note: Prefixes if truncated will be truncated from the left. Returns: Dict[str, TT]: input_ids: Integer[TT, " batch seq_len"] attention_mask: Integer[TT, " batch seq_len"]

prepare_prefix_suffix_ids(prefix_ids, suffix_ids, noise_schedule, pad_token_id, mask_token_id, eos_token_id=None, bos_token_id=None, max_seq_len=None, truncate='block', loss_on_padding=True, bos_location='after_prefix')

Prepare concatenated prefix and suffix ids for seq2seq tasks with padding on the right only

Parameters:

Name Type Description Default
loss_on_padding bool

bool - If true, pad token is treated as a normal token: it has attention on it, it is predicted as a target token. - If false, it has no attention on it, it is not predicted as a target token (-100)

True

print_batch_mdlm(batch, split, tokenizer, dataloader_name='')

Print batch information for debugging MLM batches.

Parameters:

Name Type Description Default
batch Dict[str, Any]

The batch to print.

required
split Literal['train', 'val', 'test', 'predict']

The split name.

required
tokenizer Tokenizer

The tokenizer to decode tokens.

required
dataloader_name str

Name of the dataloader.

''