xlm.tasks.uniref50
UniRef50 (Hugging Face airkingbd/uniref50) preprocessing for protein LM training.
Each row provides a single-letter amino-acid string in column seq and a
precomputed length. Tokenization should use an ESM-compatible tokenizer
(e.g. facebook/esm2_t30_150M_UR50D) configured in the experiment's
global_components.tokenizer.
Very long chains are common; pass max_seq_len (typically block_size)
via preprocess_function_kwargs in the dataset config to truncate after
encoding so cached shards stay bounded.
pack_sequences_fn(examples, tokenizer, block_size, drop_last=True, **kwargs)
DPLM-style random crop + EOS packing for UniRef50 protein sequences.
For each sequence in the batch, randomly crops to block_size if longer
(matching the subsampling logic from DPLM's UniRefHFDataset.__getitem__).
The cropped sequences are then concatenated with EOS separators and chunked
into blocks of exactly block_size via xlm.datamodule.pack_sequences.
Used as on_the_fly_group_processor in the packed UniRef50 dataset config.
tokenizer and block_size are injected automatically by
DatasetManager; drop_last can be overridden via
on_the_fly_group_processor_kwargs.