nn_multihead_attention | R Documentation |
Allows the model to jointly attend to information from different representation subspaces. See reference: Attention Is All You Need
nn_multihead_attention(
embed_dim,
num_heads,
dropout = 0,
bias = TRUE,
add_bias_kv = FALSE,
add_zero_attn = FALSE,
kdim = NULL,
vdim = NULL,
batch_first = FALSE
)
embed_dim |
total dimension of the model. |
num_heads |
parallel attention heads. Note that |
dropout |
a Dropout layer on attn_output_weights. Default: 0.0. |
bias |
add bias as module parameter. Default: True. |
add_bias_kv |
add bias to the key and value sequences at dim=0. |
add_zero_attn |
add a new batch of zeros to the key and value sequences at dim=1. |
kdim |
total number of features in key. Default: |
vdim |
total number of features in value. Default: |
batch_first |
if |
\mbox{MultiHead}(Q, K, V) = \mbox{Concat}(head_1,\dots,head_h)W^O
\mbox{where} head_i = \mbox{Attention}(QW_i^Q, KW_i^K, VW_i^V)
Inputs:
query: (L, N, E)
where L is the target sequence length, N is the
batch size, E is the embedding dimension. (but see the batch_first
argument)
key: (S, N, E)
, where S is the source sequence length, N is the
batch size, E is the embedding dimension. (but see the batch_first
argument)
value: (S, N, E)
where S is the source sequence length,
N is the batch size, E is the embedding dimension. (but see the
batch_first
argument)
key_padding_mask: (N, S)
where N is the batch size, S is the source
sequence length. If a ByteTensor is provided, the non-zero positions will
be ignored while the position with the zero positions will be unchanged. If
a BoolTensor is provided, the positions with the value of True
will be
ignored while the position with the value of False
will be unchanged.
attn_mask: 2D mask (L, S)
where L is the target sequence length, S
is the source sequence length. 3D mask (N*num_heads, L, S)
where N is
the batch size, L is the target sequence length, S is the source sequence
length. attn_mask ensure that position i is allowed to attend the unmasked
positions. If a ByteTensor is provided, the non-zero positions are not
allowed to attend while the zero positions will be unchanged. If a
BoolTensor is provided, positions with True
are not allowed to attend
while False
values will be unchanged. If a FloatTensor is provided, it
will be added to the attention weight.
Outputs:
attn_output: (L, N, E)
where L is the target sequence length, N is
the batch size, E is the embedding dimension. (but see the batch_first
argument)
attn_output_weights:
if avg_weights
is TRUE
(the default), the output attention
weights are averaged over the attention heads, giving a tensor of shape
(N, L, S)
where N is the batch size, L is the target sequence
length, S is the source sequence length.
if avg_weights
is FALSE
, the attention weight tensor is output
as-is, with shape (N, H, L, S)
, where H is the number of attention
heads.
if (torch_is_installed()) {
## Not run:
multihead_attn <- nn_multihead_attention(embed_dim, num_heads)
out <- multihead_attn(query, key, value)
attn_output <- out[[1]]
attn_output_weights <- out[[2]]
## End(Not run)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.