# Fit a peaks-over-threshold model to exceedances over a threshold

### Description

Fit a peaks-over-threshold model, designed specifically for climate data, to exceedance-only data, using the point process approach. Any covariates/predictors/features assumed to vary only between and not within blocks of observations. It includes options for variable weights (useful for local likelihood) and variable proportions of missing data, as well as for bootstrapping to estimate uncertainties. Results can be returned in terms of parameter values, return values, return periods, return probabilities, and differences in either return values or log return probabilities (i.e., log risk ratios).

### Usage

1 2 3 4 5 6 7 8 | ```
fit_pot(y, x = NULL, threshold, locationFun = NULL, scaleFun = NULL,
shapeFun = NULL, nBlocks = nrow(x), blockIndex = NULL, firstBlock = 1,
index = NULL, nReplicates = 1, replicateIndex = NULL, weights = NULL,
proportionMissing = NULL, returnPeriod = NULL, returnValue = NULL,
getParams = FALSE, getFit = FALSE, xNew = NULL, xContrast = NULL,
declustering = NULL, upperTail = TRUE, scaling = 1, bootSE = FALSE,
bootControl = list(), optimArgs = list(method = c("Nelder-Mead", "BFGS")),
initial = NULL, .normalizeX = TRUE, .getInputs = FALSE)
``` |

### Arguments

`y` |
a numeric vector of exceedance values (values of the outcome variable above the threshold). |

`x` |
a data frame, or object that can be converted to a data frame with columns corresponding to covariate/predictor/feature variables and each row containing the values of the variable for a block (e.g., often a year with climate data). The number of rows must equal the number of blocks. |

`threshold` |
a single numeric value for constant threshold or a numeric vector with length equal to the number of blocks, indicating the threshold for each block. |

`locationFun` |
formula, vector of character strings, or indices describing a linear model (i.e., regression function) for the location parameter using columns from |

`scaleFun` |
formula, vector of character strings, or indices describing a linear model (i.e., regression function) for the log of the scale parameter using columns from |

`shapeFun` |
formula, vector of character strings, or indices describing a linear model (i.e., regression function) for the shape parameter using columns from |

`nBlocks` |
number of blocks (e.g., a block will often be a year with climate data); note this value determines the interpretation of return values/periods/probabilities; see |

`blockIndex` |
numeric vector providing the index of the block corresponding to each element of |

`firstBlock` |
single numeric value indicating the numeric value of the first possible block of |

`index` |
(optional) numeric vector providing the integer-valued index (e.g., julian day for daily climate data) corresponding to each element of |

`nReplicates` |
numeric value indicating the number of replicates. |

`replicateIndex` |
numeric vector providing the index of the replicate corresponding to each element of |

`weights` |
a vector or matrix providing the weights by block. When there is only one replicate or the weights do not vary by replicate, a vector of length equal to the number of blocks. When weights vary by replicate, a matrix with rows corresponding to blocks and columns to replicates. Likelihood contribution of each block is multiplied by the corresponding weight. |

`proportionMissing` |
a numeric value, vector or matrix indicating the proportion of missing values in the original dataset before exceedances were selected. When the proportion missing is the same for all blocks and replicates, a single value. When there is only one replicate or the weights do not vary by replicate, a vector of length equal to the number of blocks. When weights vary by replicate, a matrix with rows corresponding to blocks and columns to replicates. |

`returnPeriod` |
numeric value giving the number of blocks for which return values should be calculated. For example a |

`returnValue` |
numeric value giving the value for which return probabilities/periods should be calculated, where the resulting period will be the average number of blocks until the value is exceeded and the probability the probability of exceeding the value in any single block. |

`getParams` |
logical indicating whether to return the fitted parameter values and their standard errors; WARNING: parameter values for models with covariates for the scale parameter must interpreted based on log transformation of the scale parameter. |

`getFit` |
logical indicating whether to return the full fitted model (potentially useful for model evaluation and for understanding optimization problems); note that estimated parameters in the fit object for nonstationary models will not generally match the MLE provided when |

`xNew` |
object of the same form as |

`xContrast` |
object of the same form and dimensions as |

`declustering` |
one of |

`upperTail` |
logical indicating whether one is working with exceedances over a high threshold (TRUE) or exceedances under a low threshold (FALSE); in the latter case, the function works with the negative of the values and the threshold, changing the sign of the resulting location parameters. |

`scaling` |
positive-valued scalar used to scale the data values for more robust optimization performance. When multiplied by the values, it should produce values with magnitude around 1. |

`bootSE` |
logical indicating whether to use the bootstrap to estimate standard errors. |

`bootControl` |
a list of control parameters for the bootstrapping. See ‘Details’. |

`optimArgs` |
a list with named components matching exactly any arguments that the user wishes to pass to |

`initial` |
a list with components named |

`.normalizeX` |
logical indicating whether to normalize 'x' values for better numerical performance; default is TRUE. |

`.getInputs` |
logical indicating whether to return intermediate objects used in fitting. Defaults to |

### Details

This function allows one to fit stationary or nonstationary peaks-over-threshold models using the point process approach. The function can return parameter estimates (which are asymptotically equivalent to GEV model parameters for block maxima data), return value/level for a given return period (number of blocks), and return probabilities/periods for a given return value/level. The function provides standard errors based on the usual MLE asymptotics, with delta-method-based standard errors for functionals of the parameters, but also standard errors based on the nonparametric bootstrap, either resampling by block or by replicate or both.

### Value

The primary outputs of this function are as follows, depending on what is requested via `returnPeriod`

, `returnValue`

, `getParams`

and `xContrast`

:

when `returnPeriod`

is given: for the period given in `returnPeriod`

the return value(s) (`returnValue`

) and its corresponding asymptotic standard error (`se_returnValue`

) and, when `bootSE=TRUE`

, also the bootstrapped standard error (`se_returnValue_boot`

). For nonstationary models, these correspond to the covariate values given in `x`

.

when `returnValue`

is given: for the value given in `returnValue`

, the log exceedance probability (`logReturnProb`

) and the corresponding asymptotic standard error (`se_logReturnProb`

) and, when `bootSE=TRUE`

, also the bootstrapped standard error (`se_logReturnProb_boot`

). This exceedance probability is the probability of exceedance for a single block. Also returned are the log return period (`logReturnPeriod`

) and its corresponding asymptotic standard error (`se_logReturnPeriod`

) and, when `bootSE=TRUE`

, also the bootstrapped standard error (`se_logReturnPeriod_boot`

). For nonstationary models, these correspond to the covariate values given in `x`

. Note that results are on the log scale as probabilities and return times are likely to be closer to normally distributed on the log scale and therefore standard errors are more naturally given on this scale. Confidence intervals for return probabilities/periods can be obtained by exponentiating the interval obtained from plus/minus twice the standard error of the log probabilities/periods.

when `getParams=TRUE`

: the MLE for the model parameters (`mle`

) and corresponding asymptotic standard error (`se_mle`

) and, when `bootSE=TRUE`

, also the bootstrapped standard error (`se_mle_boot`

).

when `xContrast`

is specified for nonstationary models: the difference in return values (`returnValueDiff`

) and its corresponding asymptotic standard error (`se_returnValueDiff`

) and, when `bootSE=TRUE`

, bootstrapped standard error (`se_returnValueDiff_boot`

). These differences correspond to the differences when contrasting each row in `x`

with the corresponding row in `xContrast`

. Also returned are the difference in log return probabilities (i.e., the log risk ratio) (`logReturnProbDiff`

) and its corresponding asymptotic standard error (`se_logReturnProbDiff`

) and, when `bootSE=TRUE`

, bootstrapped standard error (`se_logReturnProbDiff_boot`

).

### Analyzing aggregated observations, such as yearly averages

Aggregated average or summed data such as yearly or seasonal averages can be fit using this function. The best way to do this is to specify `nBlocks`

to be the number of observations (i.e., the length of the observation period, not the number of exceedances). Then any return probabilities can be interpreted as the probabilities for a single block (e.g., a year). If instead `nBlocks`

were one (i.e., a single block) then probabilities would be interpreted as the probability of occurrence in a multi-year block.

### Blocks and replicates

Note that blocks and replicates are related concepts. Blocks are the grouping of values such that return values and return periods are calculated based on the equivalent block maxima (or minima) generalized extreme value model. For example if a block is a year, the return period is the average number of years before the given value is seen, while the return value when `returnPeriod`

is, say, 20, is the value exceeded on average once in 20 years. A given dataset will generally have multiple blocks. In some cases a block may contain only a single value, such as when analyzing yearly sums or averages.

Replicates are repeated datasets, each with the same structure, including the same number of blocks. The additional blocks in multiple replicates could simply be treated as additional blocks without replication, but when the predictor variables and weights are the same across replicates, it is simpler to make use of `nReplicates`

and `replicateIndex`

(see next paragraph). A given replicate might only contain a single block, such as with an ensemble of short climate model runs that are run only for the length of a single block (e.g., a single year). In this case `nBlocks`

should be set to one.

When using multiple replicates (e.g., multiple members of a climate model initial condition ensemble), the standard input format is to append outcome values for additional replicates to the `y`

argument and indicate the replicate ID for each exceedance in `replicateIndex`

. However, if one has different covariate values or thresholds for different replicates, then one needs to treat the additional replicates as providing additional blocks, with only a single replicate. The covariate values can then be included as additional rows in `x`

, with `nBlocks`

reflecting the product of the number of blocks per replicate and the number of replicates and `nReplicates`

set to 1. In this situation, if `declustering`

is specified, make sure to set `index`

such that the values for separate replicates do not overlap with each other, to avoid treating exceedances from different replicates as being sequential or from a contiguous set of values. Similarly, if there is a varying number of replicates by block, then all block-replicate pairs should be treated as individual blocks with a corresponding row in `x`

.

`bootControl`

arguments

The `bootControl`

argument is a list that can supply any of the following components:

seed. Value of the random number seed to set before doing resampling. Defaults to

`0`

.n. Number of bootstrap samples. Defaults to

`250`

.by. Character string, one of

`'block'`

,`'replicate'`

, or`'joint'`

, indicating the basis for the resampling. If`'block'`

, resampled datasets will consist of blocks drawn at random from the original set of blocks; if there are replicates, each replicate will occur once for every resampled block. If`'replicate'`

, resampled datasets will consist of replicates drawn at random from the original set of replicates; all blocks from a replicate will occur in each resampled replicate. Note that this preserves any dependence across blocks rather than assuming independence between blocks. If`'joint'`

resampled datasets will consist of block-replicate pairs drawn at random from the original set of block-replicate pairs. Defaults to`'block'`

.getSample. Logical indicating whether the user wants the full bootstrap sample of parameter estimates and/or return value/period/probability information provided for use in subsequent calculations; if FALSE (the default), only the bootstrap-based estimated standard errors are returned.

### Author(s)

Christopher J. Paciorek

### References

Coles, S. 2001. An Introduction to Statistical Modeling of Extreme Values. Springer.

### Examples

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | ```
# setup Fort precipitation data
require(extRemes)
data(Fort)
firstBlock <- min(Fort$year)
years <- min(Fort$year):max(Fort$year)
nYears <- length(years)
threshold <- 0.395
ord <- order(Fort$year, Fort$month, Fort$day)
Fort <- Fort[ord, ]
ind <- Fort$Prec > threshold
FortExc <- Fort[ind, ]
# stationary fit
out <- fit_pot(FortExc$Prec, threshold = threshold, nBlocks = nYears,
blockIndex = FortExc$year, firstBlock = firstBlock,
returnPeriod = 20, returnValue = 3.5,
getParams = TRUE, bootSE = FALSE)
# fit with location linear in year
out <- fit_pot(FortExc$Prec, x = data.frame(years = years), threshold = threshold,
locationFun = ~years, nBlocks = nYears,
blockIndex = FortExc$year, firstBlock = firstBlock,
returnPeriod = 20, returnValue = 3.5,
getParams = TRUE, xNew = data.frame(years = range(Fort$year)), bootSE = FALSE)
# with declustering
index <- seq_len(nrow(Fort))
out <- fit_pot(FortExc$Prec, x = data.frame(years = years), threshold = threshold,
locationFun = ~years, nBlocks = nYears,
blockIndex = FortExc$year, firstBlock = firstBlock, index = index[ind],
returnPeriod = 20, returnValue = 3.5,
getParams = TRUE, xNew = data.frame(years = range(Fort$year)),
declustering = 'noruns', bootSE = FALSE)
# with replicates; for illustration here, I'll just duplicate the Fort data
Fort2x <- rbind(FortExc, FortExc)
n <- nrow(FortExc)
out <- fit_pot(Fort2x$Prec, x = data.frame(years = years), threshold = threshold,
locationFun = ~years, nBlocks = nYears,
blockIndex = Fort2x$year, firstBlock = firstBlock,
nReplicates = 2, replicateIndex = c(rep(1, n), rep(2, n)),
returnPeriod = 20, returnValue = 3.5,
getParams = TRUE, xNew = data.frame(years = range(Fort$year)), bootSE = FALSE)
# analysis of seasonal total precipitation
FortSummer <- Fort[Fort$month %in% 6:8, ] # summer precipitation
FortSummerSum <- aggregate(Prec ~ year, data = FortSummer, sum)
threshold <- quantile(FortSummerSum$Prec, 0.8)
FortSummerSumExc <- FortSummerSum[FortSummerSum$Prec > threshold, ]
# each year (single observation) treated as a block, so return probability
# can be interpreted as probability of exceeding a value in a single year
out <- fit_pot(FortSummerSumExc$Prec, x = data.frame(years = years), threshold = threshold,
nBlocks = nYears, blockIndex = FortSummerSumExc$year, firstBlock = firstBlock,
locationFun = ~years, returnPeriod = 20,
returnValue = 10, getParams = TRUE, xNew = data.frame(years = range(Fort$year)),
bootSE = FALSE)
``` |