mdlm.datamodule_mdlm
MDLMEmptyDataset
Bases: IterableDataset
__init__(tokenizer, num_examples, max_length)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tokenizer_kwargs
|
Keyword arguments for the tokenizer. |
required | |
TODO
|
Might want the option to add BOS. |
required |
DefaultMDLMCollator
Bases: Collator
Used for MDLM pre-training with padded-truncated sequences.
Batch
- input_ids: Integer[TT, " batch seq_len"]: The input for the model with masks.
- attention_mask: Integer[TT, " batch seq_len"]: 1 for tokens that are not padding.
- target_ids: Integer[TT, " batch seq_len"]: The target ids to the model where the input if copied as is and masks are replaced with the correct token.
Padding
- Padding is done on the right.
MDLMSeq2SeqTrainCollator
Bases: Collator
MDLM training for seq2seq tasks.
Batch
- input_ids: Integer[TT, " batch seq_len"]: The input for the model with masks.
- attention_mask: Integer[TT, " batch seq_len"]: 1 for tokens that are not padding.
- target_ids: Integer[TT, " batch seq_len"]: The target ids to the model where the input if copied as is and masks are replaced with the correct token.
Padding
- Padding is done on the right.
MDLMSeq2SeqPredCollator
Bases: Collator
Input contains only the prefix and target_ids contain only the suffix if present.
How is this different from MDLMSeq2SeqTrainCollator? - MDLMSeq2SeqTrainCollator's input_ids contain the joined sequence and target_ids also contain the target for the whole sequence. But MDLMSeq2SeqPredCollator's input_ids contain only the prefix and target_ids contain only the suffix if present.
Batch
- input_ids: Integer[TT, " batch seq_len"]: Input contains only the prefix
- attention_mask: Integer[TT, " batch seq_len"]: 1 for tokens that are not padding.
- target_ids: Integer[TT, " batch seq_len"]: Target contains only the suffix if present.
- noise_rate: Float[TT, " batch"]: The noise rate for the model.
- total_noise: Float[TT, " batch"]: The total noise for the model.
- t: Float[TT, " batch"]: The time step for the model.
Padding
- There is padding on both sides because all prefixes end at the same position.
prepare_prefix_ids(prefix_ids, pad_token_id, max_seq_len=None, truncate='block')
Prepare prefix ids for seq2seq tasks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prefix_ids
|
List[List[int]]
|
List[List[int]] |
required |
pad_token_id
|
int
|
int |
required |
max_seq_len
|
Optional[int]
|
Optional[int] |
None
|
truncate
|
Literal['max', 'block', None]
|
|
'block'
|
Note: Prefixes if truncated will be truncated from the left. Returns: Dict[str, TT]: input_ids: Integer[TT, " batch seq_len"] attention_mask: Integer[TT, " batch seq_len"]
prepare_prefix_suffix_ids(prefix_ids, suffix_ids, noise_schedule, pad_token_id, mask_token_id, eos_token_id=None, bos_token_id=None, max_seq_len=None, truncate='block', loss_on_padding=True, bos_location='after_prefix')
Prepare concatenated prefix and suffix ids for seq2seq tasks with padding on the right only
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
loss_on_padding
|
bool
|
bool - If true, pad token is treated as a normal token: it has attention on it, it is predicted as a target token. - If false, it has no attention on it, it is not predicted as a target token (-100) |
True
|
print_batch_mdlm(batch, split, tokenizer, dataloader_name='')
Print batch information for debugging MLM batches.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
batch
|
Dict[str, Any]
|
The batch to print. |
required |
split
|
Literal['train', 'val', 'test', 'predict']
|
The split name. |
required |
tokenizer
|
Tokenizer
|
The tokenizer to decode tokens. |
required |
dataloader_name
|
str
|
Name of the dataloader. |
''
|