`xlm.modules.gpt2_transformer`

Adapted from MINGPT: https://github.com/karpathy/nanoGPT/blob/master/model.py

`LayerNorm`

Bases: Module

LayerNorm but with an optional bias. PyTorch doesn't support simply bias=False

`GPT`

Bases: Module

`get_num_params(non_embedding=True)`

Return the number of parameters in the model. For non-embedding count (default), the position embeddings get subtracted. The token embeddings would too, except due to the parameter sharing these params are actually used as weights in the final layer, so we include them.

`estimate_mfu(fwdbwd_per_iter, dt)`

estimate model flops utilization (MFU) in units of A100 bfloat16 peak FLOPS