`mdlm.datamodule_mdlm`

`MDLMEmptyDataset`

Bases: IterableDataset

`init(tokenizer, num_examples, max_length)`

Parameters:

Name	Type	Description	Default
`tokenizer_kwargs`		Keyword arguments for the tokenizer.	required
`TODO`		Might want the option to add BOS.	required

`DefaultMDLMCollator`

Bases: Collator

Used for MDLM pre-training with padded-truncated sequences.

Batch

input_ids: Integer[TT, " batch seq_len"]: The input for the model with masks.
attention_mask: Integer[TT, " batch seq_len"]: 1 for tokens that are not padding.
target_ids: Integer[TT, " batch seq_len"]: The target ids to the model where the input if copied as is and masks are replaced with the correct token.

Padding

Padding is done on the right.

`MDLMSeq2SeqTrainCollator`

Bases: Collator

MDLM training for seq2seq tasks.

Batch

input_ids: Integer[TT, " batch seq_len"]: The input for the model with masks.
attention_mask: Integer[TT, " batch seq_len"]: 1 for tokens that are not padding.
target_ids: Integer[TT, " batch seq_len"]: The target ids to the model where the input if copied as is and masks are replaced with the correct token.

Padding

Padding is done on the right.

`MDLMSeq2SeqPredCollator`

Bases: Collator

Input contains only the prefix and target_ids contain only the suffix if present.

How is this different from MDLMSeq2SeqTrainCollator? - MDLMSeq2SeqTrainCollator's input_ids contain the joined sequence and target_ids also contain the target for the whole sequence. But MDLMSeq2SeqPredCollator's input_ids contain only the prefix and target_ids contain only the suffix if present.

Batch

input_ids: Integer[TT, " batch seq_len"]: Input contains only the prefix
attention_mask: Integer[TT, " batch seq_len"]: 1 for tokens that are not padding.
target_ids: Integer[TT, " batch seq_len"]: Target contains only the suffix if present.
noise_rate: Float[TT, " batch"]: The noise rate for the model.
total_noise: Float[TT, " batch"]: The total noise for the model.
t: Float[TT, " batch"]: The time step for the model.

Padding

There is padding on both sides because all prefixes end at the same position.

`prepare_prefix_ids(prefix_ids, pad_token_id, max_seq_len=None, truncate='block')`

Prepare prefix ids for seq2seq tasks.

Parameters:

Name	Type	Description	Default
`prefix_ids`	`List[List[int]]`	List[List[int]]	required
`pad_token_id`	`int`	int	required
`max_seq_len`	`Optional[int]`	Optional[int]	`None`
`truncate`	`Literal['max', 'block', None]`	"max": Truncate to max(max_seq_len, max_in_batch). when max_seq_len is not provided, it is the max in the batch. "block": Pad-Truncate to max_seq_len. None: Pad to max in the batch.	`'block'`

Note: Prefixes if truncated will be truncated from the left. Returns: Dict[str, TT]: input_ids: Integer[TT, " batch seq_len"] attention_mask: Integer[TT, " batch seq_len"]

`prepare_prefix_suffix_ids(prefix_ids, suffix_ids, noise_schedule, pad_token_id, mask_token_id, eos_token_id=None, bos_token_id=None, max_seq_len=None, truncate='block', loss_on_padding=True, bos_location='after_prefix')`

Prepare concatenated prefix and suffix ids for seq2seq tasks with padding on the right only

Parameters:

Name	Type	Description	Default
`loss_on_padding`	`bool`	bool - If true, pad token is treated as a normal token: it has attention on it, it is predicted as a target token. - If false, it has no attention on it, it is not predicted as a target token (-100)	`True`

`print_batch_mdlm(batch, split, tokenizer, dataloader_name='')`

Print batch information for debugging MLM batches.

Parameters:

Name	Type	Description	Default
`batch`	`Dict[str, Any]`	The batch to print.	required
`split`	`Literal['train', 'val', 'test', 'predict']`	The split name.	required
`tokenizer`	`Tokenizer`	The tokenizer to decode tokens.	required
`dataloader_name`	`str`	Name of the dataloader.	`''`

mdlm.datamodule_mdlm

MDLMEmptyDataset

__init__(tokenizer, num_examples, max_length)

DefaultMDLMCollator

MDLMSeq2SeqTrainCollator

MDLMSeq2SeqPredCollator

prepare_prefix_ids(prefix_ids, pad_token_id, max_seq_len=None, truncate='block')

prepare_prefix_suffix_ids(prefix_ids, suffix_ids, noise_schedule, pad_token_id, mask_token_id, eos_token_id=None, bos_token_id=None, max_seq_len=None, truncate='block', loss_on_padding=True, bos_location='after_prefix')

print_batch_mdlm(batch, split, tokenizer, dataloader_name='')

`mdlm.datamodule_mdlm`

`MDLMEmptyDataset`

`init(tokenizer, num_examples, max_length)`

`DefaultMDLMCollator`

`MDLMSeq2SeqTrainCollator`

`MDLMSeq2SeqPredCollator`

`prepare_prefix_ids(prefix_ids, pad_token_id, max_seq_len=None, truncate='block')`

`prepare_prefix_suffix_ids(prefix_ids, suffix_ids, noise_schedule, pad_token_id, mask_token_id, eos_token_id=None, bos_token_id=None, max_seq_len=None, truncate='block', loss_on_padding=True, bos_location='after_prefix')`

`print_batch_mdlm(batch, split, tokenizer, dataloader_name='')`