Skip to content

ilm.datamodule_ilm

ILMEmptyDataset

Bases: IterableDataset

__init__(tokenizer, num_examples)

Parameters:

Name Type Description Default
tokenizer_kwargs

Keyword arguments for the tokenizer.

required
empty_text

For MLM, you will want to set the empty_text to a sequence of all mask tokens.

required

DefaultILMCollator

Bases: Collator

Used for pre-training.

ILMSeq2SeqCollator

Drops tokens from the suffix only.

ILMSeq2SeqPredCollator

Bases: ILMSeq2SeqCollator

Drops all the suffix/target tokens and sends them in the target_ids of shape (batch_size, target_seq_len)

ilm_drop_fn(segment_input_ids, bos_token_id, cls_token_id, global_offset=0, sample_n_drops_fn=_n_drop_uniformly, drop_indices_fn=_drop_uniformly)

Drops tokens from a single segment of a single sequence. Adds bos. Adds cls as requested.