Skip to content

xlm.tasks.uniref50

UniRef50 (Hugging Face airkingbd/uniref50) preprocessing for protein LM training.

Each row provides a single-letter amino-acid string in column seq and a precomputed length. Tokenization should use an ESM-compatible tokenizer (e.g. facebook/esm2_t30_150M_UR50D) configured in the experiment's global_components.tokenizer.

Very long chains are common; pass max_seq_len (typically block_size) via preprocess_function_kwargs in the dataset config to truncate after encoding so cached shards stay bounded.

pack_sequences_fn(examples, tokenizer, block_size, drop_last=True, **kwargs)

DPLM-style random crop + EOS packing for UniRef50 protein sequences.

For each sequence in the batch, randomly crops to block_size if longer (matching the subsampling logic from DPLM's UniRefHFDataset.__getitem__). The cropped sequences are then concatenated with EOS separators and chunked into blocks of exactly block_size via xlm.datamodule.pack_sequences.

Used as on_the_fly_group_processor in the packed UniRef50 dataset config. tokenizer and block_size are injected automatically by DatasetManager; drop_last can be overridden via on_the_fly_group_processor_kwargs.