nn_ft_transformer_block: Single Transformer Block for FT-Transformer
In mlr3torch: Deep Learning with 'mlr3'

nn_ft_transformer_block

R Documentation

Single Transformer Block for FT-Transformer

Description

A transformer block consisting of a multi-head self-attention mechanism followed by a feed-forward network.

This is used in LearnerTorchFTTransformer.

Usage

nn_ft_transformer_block(
  d_token,
  attention_n_heads,
  attention_dropout,
  attention_initialization,
  ffn_d_hidden = NULL,
  ffn_d_hidden_multiplier = NULL,
  ffn_dropout,
  ffn_activation,
  residual_dropout,
  prenormalization,
  is_first_layer,
  attention_normalization,
  ffn_normalization,
  query_idx = NULL,
  attention_bias,
  ffn_bias_first,
  ffn_bias_second
)

Arguments

`d_token`	(`integer(1)`) The dimension of the embedding.
`attention_n_heads`	(`integer(1)`) Number of attention heads.
`attention_dropout`	(`numeric(1)`) Dropout probability in the attention mechanism.
`attention_initialization`	(`character(1)`) Initialization method for attention weights. Either "kaiming" or "xavier".
`ffn_d_hidden`	(`integer(1)`) Hidden dimension of the feed-forward network. Multiplied by 2 if using ReGLU or GeGLU activation.
`ffn_d_hidden_multiplier`	(`numeric(1)`) Alternative way to specify the hidden dimension of the feed-forward network as `d_token * d_hidden_multiplier`. Also multiplied by 2 if using RegLU or GeGLU activation.
`ffn_dropout`	(`numeric(1)`) Dropout probability in the feed-forward network.
`ffn_activation`	(`nn_module`) Activation function for the feed-forward network. Default value is `nn_reglu`.
`residual_dropout`	(`numeric(1)`) Dropout probability for residual connections.
`prenormalization`	(`logical(1)`) Whether to apply normalization before attention and FFN (`TRUE`) or after (`TRUE`).
`is_first_layer`	(`logical(1)`) Whether this is the first layer in the transformer stack. Default value is `FALSE`.
`attention_normalization`	(`nn_module`) Normalization module to use for attention. Default value is `nn_layer_norm`.
`ffn_normalization`	(`nn_module`) Normalization module to use for the feed-forward network. Default value is `nn_layer_norm`.
`query_idx`	(`integer()` or `NULL`) Indices of the tensor to apply attention to. Should not be set manually. If NULL, then attention is applied to the entire tensor. In the last block in a stack of transformers, this is set to `-1` so that attention is applied only to the embedding of the CLS token.
`attention_bias`	(`logical(1)`) Whether attention has a bias. Default is `TRUE`
`ffn_bias_first`	(`logical(1)`) Whether the first layer in the FFN has a bias. Default is `TRUE`
`ffn_bias_second`	(`logical(1)`) Whether the second layer in the FFN has a bias. Default is `TRUE`

References

Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton, Toutanova, Kristina (2018). “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805. Gorishniy Y, Rubachev I, Khrulkov V, Babenko A (2021). “Revisiting Deep Learning for Tabular Data.” arXiv, 2106.11959.

mlr3torch documentation built on Aug. 26, 2025, 5:09 p.m.