xlm.utils.nn
tiny_value_of_dtype(dtype)
Returns a moderately tiny value for a given PyTorch data type that is used to avoid numerical
issues such as division by zero.
This is different from info_value_of_dtype(dtype).tiny because it causes some NaN bugs.
Only supports floating point dtypes.
masked_mean(vector, mask, dim, keepdim=False)
To calculate mean along certain dimensions on masked values
Parameters
torch.Tensor
The vector to calculate mean.
mask : torch.BoolTensor
The mask of the vector. It must be broadcastable with vector.
It must be 1 for non-masked values and 0 for masked values.
dim : int
The dimension to calculate mean
keepdim : bool
Whether to keep dimension
Returns
torch.Tensor
A torch.Tensor of including the mean values.
masked_sum(vector, mask, dim, keepdim=False)
To calculate sum along certain dimensions on masked values
Parameters
torch.Tensor
The vector to calculate sum.
mask : torch.BoolTensor
The mask of the vector. It must be broadcastable with vector. It must be 1 for non-masked values and 0 for masked out values.
dim : int
The dimension to calculate sum
keepdim : bool
Whether to keep dimension
Returns
torch.Tensor
A torch.Tensor of including the sum values.
get_mask_from_sequence_lengths(sequence_lengths, max_length)
Given a variable of shape (batch_size,) that represents the sequence lengths of each batch
element, this function returns a (batch_size, max_length) mask variable. For example, if
our input was [2, 2, 3], with a max_length of 4, we'd return
[[1, 1, 0, 0], [1, 1, 0, 0], [1, 1, 1, 0]].
We require max_length here instead of just computing it from the input sequence_lengths
because it lets us avoid finding the max, then copying that value from the GPU to the CPU so
that we can use it to construct a new tensor.
dtype(string)
Convert a string to a PyTorch data type.
Parameters
str
The string to convert.
Returns
torch.dtype
The PyTorch data type.
sample_categorical(probs)
Need this since torch.multinomial does not accept 3D input and cannot handle unnormalized probabilities.
So we implement the "exponential race method" manually which can handle any number of leading dimensions and can handle unnormalized probabilities (not logits, )
Note: This is not differentiable.. Use gumbel softmax for it. Args: probs: (batch, seq_len, vocab_size) can have any number of leading dimensions. Note: probs should be positive, can be unnormalized. Returns: (batch, seq_len)
sample_from_logits(logits, temperature=1.0, noise_scale=1.0)
Sample from logits using the Gumbel-Max trick. Similar to sample_categorical, but works with logits (real valued). Args: logits: (batch, seq_len, vocab_size) can have any number of leading dimensions. Returns: (batch, seq_len)
sample_from_top_k(k, logits)
Sample from the top-k logits using the Gumbel-Max trick. Args: logits: (batch, seq_len, vocab_size) can have any number of leading dimensions. k: The number of top logits to consider for sampling. Returns: (batch, seq_len)
sample_from_top_p(p, logits)
Sample from the top-p logits using the Gumbel-Max trick.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
p
|
float
|
The cumulative probability threshold. Must be between 0 and 1. |
required |
logits
|
Tensor
|
A tensor of shape (*batch, seq_len, vocab_size) representing the unnormalized log probabilities for each token. |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
torch.Tensor: A tensor of shape (*batch, seq_len) containing the sampled token indices. |
add_gumbel_noise(logits, temperature=1.0, noise_scale=1.0)
Add gumbel noise to logits which will result in samples from the distribtution if argmaxed. Args: logits: (batch, seq_len, vocab_size) can have any number of leading dimensions. Assumed to be log of exponentiated scores. That is, we assume logits are $l_i$ in $p_i = exp(l_i) / \sum_i \exp(l_i)$. Returns: (batch, seq_len)
add_exp_1_noise(probs, temperature=1.0)
Sample from unnormalized probability using the exponential race method. Similar to using gumbel noise, trick but we require the probs to be positive (can be unnormalized). You can generate samples without repeatations from the output by taking argmax or topk, etc. Args: probs: (batch, seq_len, vocab_size) can have any number of leading dimensions. Returns: (batch, seq_len)
select_random_indices(inp_shape, num_unmask, select_from_mask=None, selection_score=None, temperature=1.0, selection_mode='greedy', score_mode='logits')
Select random indices from the last dimension using the selection_score. 1. If selection score is None then it is assumed to be uniform. 2. If score_mode = logits and selection_mode=sample, then temperature can be used to control the temperature of the distribution. 3. If select_from_mask is provided, indices only from these positions are sampled. Args: inp_shape: torch.Size, typeically (batch, d) num_unmask: (batch,) int tensor select_from_mask: (batch, d) tensor, if provided, we only sample from the selected selection_score: logit-like score for selection (can be negative). Should match inp_shape, so typically (batch, d) score_mode: "logits" => p_i = \exp(s_i)/\sum_j \exp(s_j) "uprobs" => p_i = s_i / \sum_j s_j