`xlm.utils.nn`

`tiny_value_of_dtype(dtype)`

Returns a moderately tiny value for a given PyTorch data type that is used to avoid numerical issues such as division by zero. This is different from info_value_of_dtype(dtype).tiny because it causes some NaN bugs. Only supports floating point dtypes.

`masked_mean(vector, mask, dim, keepdim=False)`

To calculate mean along certain dimensions on masked values

Parameters

torch.Tensor

The vector to calculate mean.

mask : torch.BoolTensor The mask of the vector. It must be broadcastable with vector. It must be 1 for non-masked values and 0 for masked values. dim : int The dimension to calculate mean keepdim : bool Whether to keep dimension

Returns

torch.Tensor A torch.Tensor of including the mean values.

`masked_sum(vector, mask, dim, keepdim=False)`

To calculate sum along certain dimensions on masked values

Parameters

torch.Tensor

The vector to calculate sum.

mask : torch.BoolTensor The mask of the vector. It must be broadcastable with vector. It must be 1 for non-masked values and 0 for masked out values. dim : int The dimension to calculate sum keepdim : bool Whether to keep dimension

Returns

torch.Tensor A torch.Tensor of including the sum values.

`get_mask_from_sequence_lengths(sequence_lengths, max_length)`

Given a variable of shape (batch_size,) that represents the sequence lengths of each batch element, this function returns a (batch_size, max_length) mask variable. For example, if our input was [2, 2, 3], with a max_length of 4, we'd return [[1, 1, 0, 0], [1, 1, 0, 0], [1, 1, 1, 0]].

We require max_length here instead of just computing it from the input sequence_lengths because it lets us avoid finding the max, then copying that value from the GPU to the CPU so that we can use it to construct a new tensor.

`dtype(string)`

Convert a string to a PyTorch data type.

Parameters

str

The string to convert.

Returns

torch.dtype The PyTorch data type.

`sample_categorical(probs)`

Need this since torch.multinomial does not accept 3D input and cannot handle unnormalized probabilities.

So we implement the "exponential race method" manually which can handle any number of leading dimensions and can handle unnormalized probabilities (not logits, )

Note: This is not differentiable.. Use gumbel softmax for it. Args: probs: (batch, seq_len, vocab_size) can have any number of leading dimensions. Note: probs should be positive, can be unnormalized. Returns: (batch, seq_len)

`sample_from_logits(logits, temperature=1.0, noise_scale=1.0)`

Sample from logits using the Gumbel-Max trick. Similar to sample_categorical, but works with logits (real valued). Args: logits: (batch, seq_len, vocab_size) can have any number of leading dimensions. Returns: (batch, seq_len)

`sample_from_top_k(k, logits)`

Sample from the top-k logits using the Gumbel-Max trick. Args: logits: (batch, seq_len, vocab_size) can have any number of leading dimensions. k: The number of top logits to consider for sampling. Returns: (batch, seq_len)

`sample_from_top_p(p, logits)`

Sample from the top-p logits using the Gumbel-Max trick.

Parameters:

Name	Type	Description	Default
`p`	`float`	The cumulative probability threshold. Must be between 0 and 1.	required
`logits`	`Tensor`	A tensor of shape (*batch, seq_len, vocab_size) representing the unnormalized log probabilities for each token.	required

Returns:

Type	Description
`Tensor`	torch.Tensor: A tensor of shape (*batch, seq_len) containing the sampled token indices.

`add_gumbel_noise(logits, temperature=1.0, noise_scale=1.0)`

Add gumbel noise to logits which will result in samples from the distribtution if argmaxed. Args: logits: (batch, seq_len, vocab_size) can have any number of leading dimensions. Assumed to be log of exponentiated scores. That is, we assume logits are $l_i$ in $p_i = exp(l_i) / \sum_i \exp(l_i)$. Returns: (batch, seq_len)

`add_exp_1_noise(probs, temperature=1.0)`

Sample from unnormalized probability using the exponential race method. Similar to using gumbel noise, trick but we require the probs to be positive (can be unnormalized). You can generate samples without repeatations from the output by taking argmax or topk, etc. Args: probs: (batch, seq_len, vocab_size) can have any number of leading dimensions. Returns: (batch, seq_len)

`select_random_indices(inp_shape, num_unmask, select_from_mask=None, selection_score=None, temperature=1.0, selection_mode='greedy', score_mode='logits')`

Select random indices from the last dimension using the selection_score. 1. If selection score is None then it is assumed to be uniform. 2. If score_mode = logits and selection_mode=sample, then temperature can be used to control the temperature of the distribution. 3. If select_from_mask is provided, indices only from these positions are sampled. Args: inp_shape: torch.Size, typeically (batch, d) num_unmask: (batch,) int tensor select_from_mask: (batch, d) tensor, if provided, we only sample from the selected selection_score: logit-like score for selection (can be negative). Should match inp_shape, so typically (batch, d) score_mode: "logits" => p_i = \exp(s_i)/\sum_j \exp(s_j) "uprobs" => p_i = s_i / \sum_j s_j