View source: R/nnfactivation.R
Allows the model to jointly attend to information from different representation subspaces. See reference: Attention Is All You Need
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26  nnf_multi_head_attention_forward(
query,
key,
value,
embed_dim_to_check,
num_heads,
in_proj_weight,
in_proj_bias,
bias_k,
bias_v,
add_zero_attn,
dropout_p,
out_proj_weight,
out_proj_bias,
training = TRUE,
key_padding_mask = NULL,
need_weights = TRUE,
attn_mask = NULL,
avg_weights = TRUE,
use_separate_proj_weight = FALSE,
q_proj_weight = NULL,
k_proj_weight = NULL,
v_proj_weight = NULL,
static_k = NULL,
static_v = NULL
)

query 
(L, N, E) where L is the target sequence length, N is the batch size, E is the embedding dimension. 
key 
(S, N, E), where S is the source sequence length, N is the batch size, E is the embedding dimension. 
value 
(S, N, E) where S is the source sequence length, N is the batch size, E is the embedding dimension. 
embed_dim_to_check 
total dimension of the model. 
num_heads 
parallel attention heads. 
in_proj_weight 
input projection weight and bias. 
in_proj_bias 
currently undocumented. 
bias_k 
bias of the key and value sequences to be added at dim=0. 
bias_v 
currently undocumented. 
add_zero_attn 
add a new batch of zeros to the key and value sequences at dim=1. 
dropout_p 
probability of an element to be zeroed. 
out_proj_weight 
the output projection weight and bias. 
out_proj_bias 
currently undocumented. 
training 
apply dropout if is 
key_padding_mask 
(N, S) where N is the batch size, S is the source sequence length.
If a ByteTensor is provided, the nonzero positions will be ignored while the position
with the zero positions will be unchanged. If a BoolTensor is provided, the positions with the
value of 
need_weights 
output attn_output_weights. 
attn_mask 
2D mask (L, S) where L is the target sequence length, S is the source sequence length.
3D mask (N*num_heads, L, S) where N is the batch size, L is the target sequence length,
S is the source sequence length. attn_mask ensure that position i is allowed to attend the unmasked
positions. If a ByteTensor is provided, the nonzero positions are not allowed to attend
while the zero positions will be unchanged. If a BoolTensor is provided, positions with 
avg_weights 
Logical; whether to average attn_output_weights over the attention heads before outputting them. This doesn't change the returned value of attn_output; it only affects the returned attention weight matrix. 
use_separate_proj_weight 
the function accept the proj. weights for query, key, and value in different forms. If false, in_proj_weight will be used, which is a combination of q_proj_weight, k_proj_weight, v_proj_weight. 
q_proj_weight 
input projection weight and bias. 
k_proj_weight 
currently undocumented. 
v_proj_weight 
currently undocumented. 
static_k 
static key and value used for attention operators. 
static_v 
currently undocumented. 
