attention_bahdanau_monotonic: Bahdanau Monotonic Attention
In henry090/tfaddons: Interface to 'TensorFlow SIG Addons'

attention_bahdanau_monotonic

R Documentation

Bahdanau Monotonic Attention

Description

Monotonic attention mechanism with Bahadanau-style energy function.

Usage

attention_bahdanau_monotonic(
  object,
  units,
  memory = NULL,
  memory_sequence_length = NULL,
  normalize = FALSE,
  sigmoid_noise = 0,
  sigmoid_noise_seed = NULL,
  score_bias_init = 0,
  mode = "parallel",
  kernel_initializer = "glorot_uniform",
  dtype = NULL,
  name = "BahdanauMonotonicAttention",
  ...
)

Arguments

`object`	Model or layer object
`units`	The depth of the query mechanism.
`memory`	The memory to query; usually the output of an RNN encoder. This tensor should be shaped [batch_size, max_time, ...].
`memory_sequence_length`	(optional): Sequence lengths for the batch entries in memory. If provided, the memory tensor rows are masked with zeros for values past the respective sequence lengths.
`normalize`	Python boolean. Whether to normalize the energy term.
`sigmoid_noise`	Standard deviation of pre-sigmoid noise. See the docstring for '_monotonic_probability_fn' for more information.
`sigmoid_noise_seed`	(optional) Random seed for pre-sigmoid noise.
`score_bias_init`	Initial value for score bias scalar. It's recommended to initialize this to a negative value when the length of the memory is large.
`mode`	How to compute the attention distribution. Must be one of 'recursive', 'parallel', or 'hard'. See the docstring for tfa.seq2seq.monotonic_attention for more information.
`kernel_initializer`	(optional), the name of the initializer for the attention kernel.
`dtype`	The data type for the query and memory layers of the attention mechanism.
`name`	Name to use when creating ops.
`...`	A list that contains other common arguments for layer creation.

Details

This type of attention enforces a monotonic constraint on the attention distributions; that is once the model attends to a given point in the memory it can't attend to any prior points at subsequence output timesteps. It achieves this by using the _monotonic_probability_fn instead of softmax to construct its attention distributions. Since the attention scores are passed through a sigmoid, a learnable scalar bias parameter is applied after the score function and before the sigmoid. Otherwise, it is equivalent to BahdanauAttention. This approach is proposed in

Colin Raffel, Minh-Thang Luong, Peter J. Liu, Ron J. Weiss, Douglas Eck, "Online and Linear-Time Attention by Enforcing Monotonic Alignments." ICML 2017. https://arxiv.org/abs/1704.00784