layer_additive_attention: Additive attention layer, a.k.a. Bahdanau-style attention

View source: R/layer-attention.R

layer_additive_attentionR Documentation

Additive attention layer, a.k.a. Bahdanau-style attention

Description

Additive attention layer, a.k.a. Bahdanau-style attention

Usage

layer_additive_attention(
  object,
  use_scale = TRUE,
  ...,
  causal = FALSE,
  dropout = 0
)

Arguments

object

What to compose the new Layer instance with. Typically a Sequential model or a Tensor (e.g., as returned by layer_input()). The return value depends on object. If object is:

  • missing or NULL, the Layer instance is returned.

  • a Sequential model, the model with an additional layer is returned.

  • a Tensor, the output tensor from layer_instance(object) is returned.

use_scale

If TRUE, will create a variable to scale the attention scores.

...

standard layer arguments.

causal

Boolean. Set to TRUE for decoder self-attention. Adds a mask such that position i cannot attend to positions j > i. This prevents the flow of information from the future towards the past.

dropout

Float between 0 and 1. Fraction of the units to drop for the attention scores.

Details

Inputs are query tensor of shape ⁠[batch_size, Tq, dim]⁠, value tensor of shape ⁠[batch_size, Tv, dim]⁠ and key tensor of shape ⁠[batch_size, Tv, dim]⁠. The calculation follows the steps:

  1. Reshape query and key into shapes ⁠[batch_size, Tq, 1, dim]⁠ and ⁠[batch_size, 1, Tv, dim]⁠ respectively.

  2. Calculate scores with shape ⁠[batch_size, Tq, Tv]⁠ as a non-linear sum: scores = tf.reduce_sum(tf.tanh(query + key), axis=-1)

  3. Use scores to calculate a distribution with shape ⁠[batch_size, Tq, Tv]⁠: distribution = tf$nn$softmax(scores).

  4. Use distribution to create a linear combination of value with shape ⁠[batch_size, Tq, dim]⁠: ⁠return tf$matmul(distribution, value)⁠.

See Also


keras documentation built on May 29, 2024, 3:20 a.m.