attention_layer | R Documentation |
Performs multi-headed attention from from_tensor
to to_tensor
.
This is an implementation of multi-headed attention based on "Attention is
all you Need". If from_tensor
and to_tensor
are the same, then
this is self-attention. Each timestep in from_tensor
attends to the
corresponding sequence in to_tensor
, and returns a fixed-with vector.
This function first projects from_tensor
into a "query" tensor and
to_tensor
into "key" and "value" tensors. These are (effectively) a
list of tensors of length num_attention_heads
, where each tensor is of
shape [batch_size, seq_length, size_per_head]
. Then, the query and key
tensors are dot-producted and scaled. These are softmaxed to obtain attention
probabilities. The value tensors are then interpolated by these
probabilities, then concatenated back to a single tensor and returned.
attention_layer( from_tensor, to_tensor, attention_mask = NULL, num_attention_heads = 1L, size_per_head = 512L, query_act = NULL, key_act = NULL, value_act = NULL, attention_probs_dropout_prob = 0, initializer_range = 0.02, do_return_2d_tensor = FALSE, batch_size = NULL, from_seq_length = NULL, to_seq_length = NULL )
from_tensor |
Float Tensor of shape |
to_tensor |
Float Tensor of shape |
attention_mask |
(optional) Integer Tensor of shape |
num_attention_heads |
Integer; number of attention heads. |
size_per_head |
Integer; size of each attention head. |
query_act |
(Optional) Activation function for the query transform. |
key_act |
(Optional) Activation function for the key transform. |
value_act |
(Optional) Activation function for the value transform. |
attention_probs_dropout_prob |
(Optional) Numeric; dropout probability of the attention probabilities. |
initializer_range |
Numeric; range of the weight initializer. |
do_return_2d_tensor |
Logical. If TRUE, the output will be of shape
|
batch_size |
(Optional) Integer; if the input is 2D, this might (sic) be
the batch size of the 3D version of the |
from_seq_length |
(Optional) Integer; if the input is 2D, this might be
the seq length of the 3D version of the |
to_seq_length |
(Optional) Integer; if the input is 2D, this might be
the seq length of the 3D version of the |
In practice, the multi-headed attention are done with transposes and reshapes rather than actual separate tensors.
float Tensor of shape [batch_size, from_seq_length,
num_attention_heads * size_per_head]
. If do_return_2d_tensor
is
TRUE, it will be flattened to shape [batch_size * from_seq_length,
num_attention_heads * size_per_head]
.
## Not run: # Maybe add examples later. For now, this is only called from # within transformer_model(), so refer to that function. ## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.