ilm.datamodule_ilm
ILMEmptyDataset
Bases: IterableDataset
__init__(tokenizer, num_examples)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tokenizer_kwargs
|
Keyword arguments for the tokenizer. |
required | |
empty_text
|
For MLM, you will want to set the |
required |
DefaultILMCollator
Bases: Collator
Used for pre-training.
ILMSeq2SeqCollator
Drops tokens from the suffix only.
ILMSeq2SeqPredCollator
Bases: ILMSeq2SeqCollator
Drops all the suffix/target tokens and sends them in the target_ids of shape (batch_size, target_seq_len)
ilm_drop_fn(segment_input_ids, bos_token_id, cls_token_id, global_offset=0, sample_n_drops_fn=_n_drop_uniformly, drop_indices_fn=_drop_uniformly)
Drops tokens from a single segment of a single sequence. Adds bos. Adds cls as requested.