knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library("MultiHazard") library("dplyr") library("scales") library("statmod") library("CDVine")
The MultiHazard
package provides tools for stationary multivariate statistical modeling such as of the joint distribution of MULTIple co-occurring HAZARDs. The package contains functions for pre-processing data including imputing missing values, detrending and declustering time series (Section 2) as well as analyzing pairwise correlations over a range of lags (Section 3). Functionality is also built in to conditionally sample a bivariate dataset (given one of the variables is above a predetermined threshold), selecting the best fitting amongst an array of parametric (extreme and non-extreme, truncated and non-truncated) marginal distributions or copulas, which allow the dependence structure between the set of variables to be modelled independenly of marginal distributions (Section 3). Estimation of joint probability contours using the method of overlaying (conditional) contours given in Bender et al. (2016) and subsequently for a given return period extracting design events assuming full dependence, as well as the "most likely" or an ensemble of possible design events once accounting for dependence is possible (Section 4). The package also provides the capability of fitting and simulating synthetic records from three higher dimensional approaches - standard (elliptic/Archimedean) copulas, Pair Copula Constructions (PCCs) and the conditional threshold exceedance approach of Heffernan and Tawn (2004) (Section 5).
Well G_3356 represents the groundwater level at Site S20, however, it contains missing values. Lets impute missing values in the record at Well G_3356 using recordings at nearby Well G_3355. Firstly, reading in the two time series.
#Viewing first few rows of in the groundwater level records head(G_3356) head(G_3355) #Converting Date column to "Date"" object G_3356$Date<-seq(as.Date("1985-10-23"), as.Date("2019-05-29"), by="day") G_3355$Date<-seq(as.Date("1985-08-20"), as.Date("2019-06-02"), by="day") #Converting column containing the readings to a "numeric"" object G_3356$Value<-as.numeric(as.character(G_3356$Value)) G_3355$Value<-as.numeric(as.character(G_3355$Value))
Warning message confirms there are NAs in the record at Well G_3356. Before carrying out the imputation the two data frames need to be merged.
#Merge the two dataframes by date GW_S20<-left_join(G_3356,G_3355,by="Date") colnames(GW_S20)<-c("Date","G3356","G3355") #Carrying out imputation Imp<-Imputation(Data=GW_S20,Variable="G3356",x_lab="G-3355 (ft NGVD 29)", y_lab="G-3356 (ft NGVD 29)")
The completed record is given in the ValuesFilled
column of the data frame outputted as the Data
object while the linear regression model including its coefficient of determinant are given by the model
output argument.
head(Imp$Data) Imp$Model
Are any values still NA?
G_3356_ValueFilled_NA<-which(is.na(Imp$Data$ValuesFilled)==TRUE) length(G_3356_ValueFilled_NA)
Linear interpolating the three remaining NAs.
G3356_approx<-approx(seq(1,length(Imp$Data$ValuesFilled),1),Imp$Data$ValuesFilled, xout=seq(1,length(Imp$Data$ValuesFilled),1)) Imp$Data$ValuesFilled[which(is.na(Imp$Data$ValuesFilled)==TRUE)]<- G3356_approx$y[which(is.na(Imp$Data$ValuesFilled)==TRUE)]
In the analysis completed O-sWL (Ocean-side Water Level) and groundwater level series are subsequently detrended. The Detrend() function uses either a linear fit covering the entire data (Method=linear
) or moving averge window (Method=window
) of a specified length (Window_Width
) to remove trends from a time series. The residuals are added to the final End_Length
observations.
The default Detrend() parameters specify a moving average (Method=window
) three month window (Window_Width=89
), to remove any seasonality from the time series. The deafult is then to add the residuals to the average of the final five years of observations (End_Length=1826
) to bring the record to the present day level, accounting for the Perigian tide in the case of O-sWL. The mean of the observations over the first three months were subtracted from the values during this period before the present day (5 year) average was added. The following R code detrends the record at Well G_3356. Note the function requires a Date object and the completed series.
#Cresaring a data from with the imputed series alongside the corresponding dates G_3356_Imp<-data.frame(Imp$Data$Date,Imp$Data$ValuesFilled) colnames(G_3356_Imp)<-c("Date","ValuesFilled") #Detrending G_3356_Detrend<-Detrend(Data=G_3356_Imp,PLOT=TRUE,x_lab="Date", y_lab="Groundwater level (ft NGVD 29)")
Output of the Detrend_3Month() function is simply the detrended time series.
head(G_3356_Detrend)
Creating a data frame containing the detrended groundwater series at site S_20 i.e. G_3356_Detrend and their corresponding dates
S20.Groundwater.Detrend.df<-data.frame(as.Date(GW_S20$Date),G_3356_Detrend) colnames(S20.Groundwater.Detrend.df)<-c("Date","Groundwater")
The Decluster() function declusters a time series using a threshold u specified as a quantile of the completed series and separation criterion SepCrit to ensure independent events. If mu=365.25 then SepCrit denotes the minimum number of days readings must remain below the threshold before a new event is defined.
G_3356.Declustered<-Decluster(Data=G_3356_Detrend,u=0.95,SepCrit=3,mu=365.25)
Plot showing the completed, detrended record at Well G-3356 (grey circles) along with cluster maxima (red circles) identified using a 95% threshold (green line) and three day separation criterion.
G_3356_Imp$Detrend<-G_3356_Detrend plot(as.Date(G_3356_Imp$Date),G_3356_Imp$Detrend,col="Grey",pch=16, cex=0.25,xlab="Date",ylab="Groundwater level (ft NGVD 29)") abline(h=G_3356.Declustered$Threshold,col="Dark Green") points(as.Date(G_3356_Imp$Date[G_3356.Declustered$EventsMax]), G_3356.Declustered$Declustered[G_3356.Declustered$EventsMax],col="Red",pch=16,cex=0.5)
Other outputs from the Decluster() function include the threshold on the original scale
G_3356.Declustered$Threshold
and the number of events per year
G_3356.Declustered$EventsPerYear
In preparation for later work, lets assign the detrended and declustered groundwater series at site S20 a name.
S20.Groundwater.Detrend.Declustered<-G_3356.Declustered$Declustered
Reading in the other rainfall and O-sWL series at Site S20
S20.Rainfall.df<-Perrine_df S20.Rainfall.df$Date<-as.Date(S20.Rainfall.df$Date) S20.Rainfall.df$Na.Removed<-S20.Rainfall.df$Value S20.Rainfall.df$Na.Removed[which(is.na(S20.Rainfall.df$Value)==TRUE)]<-0 S20.OsWL.df<-S20_T_MAX_Daily_Completed_Detrend_Declustered[,c(2,4)] S20.OsWL.df$Date<-as.Date(S20.OsWL.df$Date)
Detrending and declustering the rainfall and O-sWL series at Site S20
S20.OsWL.Detrend<-Detrend(Data=S20.OsWL.df,Method = "window",PLOT=FALSE, x_lab="Date",y_lab="O-sWL (ft NGVD 29)")
Creating a dataframe with the date alongside the detrended OsWL series
S20.OsWL.Detrend.df<-data.frame(as.Date(S20.OsWL.df$Date),S20.OsWL.Detrend) colnames(S20.OsWL.Detrend.df)<-c("Date","OsWL")
Declustering rainfall and O-sWL series at site S20,
#Setting missing values to zero S20.Rainfall.Declustered<-Decluster(Data=S20.Rainfall.df$Na.Removed,u=0.95,SepCrit=3)$Declustered S20.Rainfall.Declustered[which(is.na(S20.Rainfall.df$Value)==TRUE)]<-NA #S20.O-sWL does not contain any missing values and so can be declustered without S20.OsWL.Detrend.Declustered<-Decluster(Data=S20.OsWL.Detrend,u=0.95,SepCrit=3,mu=365.25)$Declustered
Creating data frames with the date alongside declustered series
S20.OsWL.Detrend.Declustered.df<-data.frame(S20.OsWL.df$Date,S20.OsWL.Detrend.Declustered) colnames(S20.OsWL.Detrend.Declustered.df)<-c("Date","OsWL") S20.Rainfall.Declustered.df<-data.frame(S20.Rainfall.df$Date,S20.Rainfall.Declustered) colnames(S20.Rainfall.Declustered.df)<-c("Date","Rainfall") S20.Groundwater.Detrend.Declustered.df<-data.frame(G_3356$Date,S20.Groundwater.Detrend.Declustered) colnames(S20.Groundwater.Detrend.Declustered.df)<-c("Date","OsWL")
Use the Dataframe_Combine function to create data frames containing all observations of the original, detrended if necessary, and declustered time series. On dates where not all variables are observed, missing values are assigned NA.
S20.Detrend.df<-Dataframe_Combine(data.1<-S20.Rainfall.df, data.2<-S20.OsWL.Detrend.df, data.3<-S20.Groundwater.Detrend.df, names=c("Rainfall","OsWL","Groundwater")) S20.Detrend.Declustered.df<-Dataframe_Combine(data.1<-S20.Rainfall.Declustered.df, data.2<-S20.OsWL.Detrend.Declustered.df, data.3<-S20.Groundwater.Detrend.Declustered.df, names=c("Rainfall","OsWL","Groundwater"))
The GPD_Fit() function fits a generalized Pareto distribution (GPD) to observations above a threshold u, specified as a quantile of the completed time series. To fit the distribution the GPD_Fit() function requires the declustered series as its Data argument and the entire completed series, detrended if necessary, as its Data.Full argument. The completed series is required to calcuate the value on the original scale corresponding to u. If PLOT=="TRUE" then diagnostic plots are produced to allow an assessment of the fit.
GPD_Fit(Data=S20.Detrend.Declustered.df$Rainfall,Data_Full=na.omit(S20.Detrend.df$Rainfall), u=0.997,PLOT="TRUE",xlab_hist="O-sWL (ft NGVD 29)",y_lab="O-sWL (ft NGVD 29)")
We can use the Kendall_Lag
function to view the Kendall's rank correlations coefficient $\tau$ between the time seres over a range of lags
S20.Kendall.Results<-Kendall_Lag(Data=S20.Detrend.df,GAP=0.2)
Lets pull out the Kendall correlation coefficient values between rainfall and O-sWL for lags of $-5,...,0,..,5$ applied to the latter quantity
S20.Kendall.Results$Value$Rainfall_OsWL
and the corresponding p-values testing the null hypothesis $\tau =0$
S20.Kendall.Results$Test$Rainfall_OsWL_Test
In the report the 2D analysis considers the two forcings currently accounted for in structural design assessments undertaken by SFWMD: rainfall and O-sWL. The 2D analysis commences with the well-established two-sided conditional sampling approach, where excesses of a conditioning variable are paired with co-occurring values of another variable to create two samples. For each sample the marginals (one extreme, one non-extreme) and joint distribution are then modeled.
The two (conditional) joint distributions are modeled independently of the marginals by using a copula. The Copula_Threshold_2D() function explores the sensitivity of the best fitting copula, in terms of Akaike Information Criterion (AIC), to allow the practitioner to make an informed choice with regards to threshold selection. It undertakes the conditional sampling described above and reports the best fitting bivariate copula. The procedure is carried out for a single or range of thresholds specified by the Thres
argument and the procedure is automatically repeated with the variables switched.
Copula_Threshold_2D(Data_Detrend=S20.Detrend.df[,-c(1,4)], Data_Declust=S20.Detrend.Declustered.df[,-c(1,4)], y_lim_min=-0.075, y_lim_max =0.25, Upper=c(2,9), Lower=c(2,10),GAP=0.15)
The Diag_Non_Con() function is designed to aid in the selection of the appropriate (non-extreme) unbounded marginal distribution for the non-conditioned variable.
S20.Rainfall<-Con_Sampling_2D(Data_Detrend=S20.Detrend.df[,-c(1,4)], Data_Declust=S20.Detrend.Declustered.df[,-c(1,4)], Con_Variable="Rainfall",Thres=0.97) Diag_Non_Con(Data=S20.Rainfall$Data$OsWL,x_lab="O-sWL (ft NGVD)",y_lim_min=0,y_lim_max=1.5)
The Diag_Non_Con_Sel() function, is similar to the Diag_Non_Con() command, but only plots the probability density function and cumulative distribution function of a (single) selected univariate distribution in order to more clearly demonstrate the goodness of fit of a particular distribution. The options are the Gaussian (Gaus
) and logistic (Logis
) distributions.
Diag_Non_Con_Sel(Data=S20.Rainfall$Data$OsWL,x_lab="O-sWL (ft NGVD)", y_lim_min=0,y_lim_max=1.5,Selected="Logis")
A generalized Pareto distribution is fitted to the marginal distribution of the conditioning variable i.e. the declustered excesses identified using Con_Sampling_2D().
Use GPD_Fit
to model the OsWL excesses in the S22.Rainfall$Data
sample
GPD_Fit(Data=S20.Rainfall$Data$Rainfall,Data_Full=S20.Rainfall$Data$Rainfall,u=0, PLOT="TRUE",xlab_hist="Rainfall (Inches)",y_lab="Rainfall (Inches)")
The process of selecting a conditional sample and fitting marginal distributions is repeated but instead conditioning on O-sWL. The non-conditional variable in this case is (total daily) rainfall, which has a lower bound at zero, and thus requires a suitably truncated distribution. The Diag_Non_Con_Trunc
fits a selection of truncated distributions to a vector of data. The Diag_Non_Con_Sel_Trunc
function is analogous to the 'Diag_Non_Con_Sel' function, avalibale distributions are the Birnbaum-Saunders (BS
), exponential (Exp
), gamma (Gam
), inverse Gaussian (InvG
), lognormal (LogN
), Tweedie (Twe
) and Weibull (Weib
).
S20.OsWL<-Con_Sampling_2D(Data_Detrend=S20.Detrend.df[,-c(1,4)], Data_Declust=S20.Detrend.Declustered.df[,-c(1,4)], Con_Variable="OsWL",Thres=0.98) Diag_Non_Con_Trunc(Data=S20.OsWL$Data$Rainfall+0.001,x_lab="Rainfall (Inches)", y_lim_min=0,y_lim_max=2) Diag_Non_Con_Trunc_Sel(Data=S20.OsWL$Data$Rainfall+0.001,x_lab="Rainfall (Inches)", y_lim_min=0,y_lim_max=2, Selected="BS")
Using GPD_Fit
to assess the fit of the GPD to the OsWL excesses
GPD_Fit(Data=S20.OsWL$Data$OsWL,Data_Full=S20.OsWL$Data$OsWL,u=0, PLOT="TRUE",xlab_hist="OsWL (ft NGVD)",y_lab="OsWL (ft NGVD)")
The Design_Event_2D()
function finds the isoline associated with a particular return period, by overlaying the two corresponding isolines from the joint distributions fitted to the conditional samples using the method in Bender et al. (2016). Design_Event_2D()
requires the copulas families chosen to model the dependence structure in the two conditional samples as input.
S20.Copula.Rainfall<-Copula_Threshold_2D(Data_Detrend=S20.Detrend.df[,-c(1,4)], Data_Declust=S20.Detrend.Declustered.df[,-c(1,4)], Thres =0.97,y_lim_min=-0.075,y_lim_max=0.25, Upper=c(2,9),Lower=c(2,10),GAP=0.15)$Copula_Family_Var1 S20.Copula.OsWL<-Copula_Threshold_2D(Data_Detrend=S20.Detrend.df[,-c(1,4)], Data_Declust=S20.Detrend.Declustered.df[,-c(1,4)], Thres =0.97,y_lim_min=-0.075, y_lim_max =0.25, Upper=c(2,9),Lower=c(2,10),GAP=0.15)$Copula_Family_Var2
As input the function requires Data = Original (detrended) rainfall and O-sWL series Data_Con1/Data_Con2 = two conditionally sampled datasets, Thres1/Thres2 = two thresholds associated with the conditionally sampled datasets Copula_Family1/Copula_Family2 two families of the two fitted copulas Marginal_Dist1/Marginal_Dist2 Selected non-extreme margnal distributions HazScen = Hazard scenario (AND/OR) RP = Return Period of interest N = size of the sample from the fitted joint distributions used to estimate the density along the isoline of interest * N_Ensemble = size of the ensemble of events sampled along the isoline of interest
S20.Bivariate<-Design_Event_2D(Data=S20.Detrend.df[,-c(1,4)], Data_Con1=S20.Rainfall$Data, Data_Con2=S20.OsWL$Data, Thres1=0.97, Thres2=0.97, Copula_Family1=S20.Copula.Rainfall, Copula_Family2=S20.Copula.OsWL, Marginal_Dist1="Logis", Marginal_Dist2="Twe", x_lab="Rainfall (mm)",y_lab="O-sWL (m NGVD 29)", RP=100,N=10,N_Ensemble=10)
Design event according to the "Most likely" event approach (diamond in the plot)
S20.Bivariate$MostLikelyEvent
Design event under the assuption of full dependence (Triangle in the plot)
S20.Bivariate$FullDependence
In the report three higher dimensional (>3) approaches are implemented to model the joint distribution of rainfall, O-sWL and groundwater level, they are:
In the package, each approach has a _Fit
and _Sim
function. The latter requires a MIGPD
object as its Marginals
input argument, in order for the simulations on $[0,1]3$ to be transformed back to the original scale. The Migpd_Fit
command fits independent GPDs to the data in each row of a dataframe (excluding the first column if it is a "Date" object) creating a MIGPD
object.
S20.Migpd<-Migpd_Fit(Data=S20.Detrend.Declustered.df[,-1],mqu=c(0.975,0.975,0.9676)) summary(S20.Migpd)
Standard (trivariate) copula are the most conceptually simple of the copula based models, using a single parametric multivariate probability distribution as the copula. The Standard_Copula_Fit() function fits elliptic (specified by Gaussian
or tcop
) or Archimedean (specified by Gumbel
,Clayton
or Frank
) copula to a trivariate dataset. Let first fit a Gaussian copula
S20.Gaussian<-Standard_Copula_Fit(Data=S20.Detrend.df,Copula_Type="Gaussian")
From which the Standard_Copula_Sim() function can be used to simulate a synthetic record of N years
S20.Gaussian.Sim<-Standard_Copula_Sim(Data=S20.Detrend.df,Marginals=S20.Migpd, Copula=S20.Gaussian,N=100)
Plotting the observationed and simulated values
S20.Pairs.Plot.Data<-data.frame(rbind(na.omit(S20.Detrend.df[,-1]),S20.Gaussian.Sim$x.Sim), c(rep("Observation",nrow(na.omit(S20.Detrend.df))), rep("Simulation",nrow(S20.Gaussian.Sim$x.Sim)))) colnames(S20.Pairs.Plot.Data)<-c(names(S20.Detrend.df)[-1],"Type") pairs(S20.Pairs.Plot.Data[,1:3], col=ifelse(S20.Pairs.Plot.Data$Type=="Observation","Black",alpha("Red",0.3)), upper.panel=NULL,pch=16)
The Standard_Copula_Sel() function can be used to deduce the best fitting in terms of AIC
Standard_Copula_Sel(Data=S20.Detrend.df)
Standard trivariate copulas lack flexibility to model joint distributions where heterogeneous dependencies exist between the variable pairs. Pair copula constructions construct multivariate distribution using a cascade of bivariate copulas (some of which are conditional). As the dimensionality of the problem increases the number of mathematically equally valid decompositions quickly becomes large. Bedford and Cooke (2001,2002) introduced the regular vine, a graphical model which helps to organize the possible decompositions. The Canonical (C-) and D- vine are two commonly utilized sub-categories of regular vines, in the trivariate case a vine copula is simultaneously a C- and D-vine. Lets fit a regular vine copula model
S20.Vine<-Vine_Copula_Fit(Data=S20.Detrend.df)
From which the Vine_Copula_Sim() function can be used to simulate a synthetic record of N years
S20.Vine.Sim<-Vine_Copula_Sim(Data=S20.Detrend.df,Marginals=S20.Migpd, Vine_family=S20.Vine$Family,Vine_par=S20.Vine$Par,Vine_par2=S20.Vine$Par2,N=100)
Plotting the observationed and simulated values
S20.Pairs.Plot.Data<-data.frame(rbind(na.omit(S20.Detrend.df[,-1]),S20.Vine.Sim$x.Sim), c(rep("Observation",nrow(na.omit(S20.Detrend.df))), rep("Simulation",nrow(S20.Vine.Sim$x.Sim)))) colnames(S20.Pairs.Plot.Data)<-c(names(S20.Detrend.df)[-1],"Type") pairs(S20.Pairs.Plot.Data[,1:3], col=ifelse(S20.Pairs.Plot.Data$Type=="Observation","Black",alpha("Red",0.3)), upper.panel=NULL,pch=16)
Finally, lets implement the Heffernan and Tawn (2004) approach, where a non-linear regression model is fitted to the (joint) observations where a (conditioning) variable is above a specified threshold. The regression model typically adopted is $$\textbf{Y}{-i}=\textbf{a}Y{i}+Y_{i}^{b}\textbf{Z} \hspace{1cm} for \hspace{1cm} Y_{i}>v$$ where $\textbf{Y}$ is a set of variables transdformed to a common scale, $\textbf{Y}{-i}$ is the set of variables excluding $Y{-i}$, $\textbf{a}$ and $\textbf{b}$ are vecotrs of regression parameters and $\textbf{Z}$ is a vecotr of residuals. The dependence structure, when a specified variable is extreme is thus captured by the regression parameters and the joint residuals. The procedure is repeated conditioning on each variable in turn to build up of the joint distribution when at least one variable is in an extreme state. The HT04
command fits and simulates N years worth of simulations from the model.
S20.HT04<-HT04(data_Detrend_Dependence_df=S20.Detrend.df, data_Detrend_Declustered_df=S20.Detrend.Declustered.df, u_Dependence=0.995,Migpd=S20.Migpd,mu=365.25,N=1000)
Output of the function includes the three conditional Models
, proportion of occasions where each variable is most extreme given at least one variable is extreme prop
as well as, the simulations on the transfromed scale u.Sim
(gumbel by default) and original scale x.Sim
. Lets view the fitted model when conditioning on rainfall
S20.HT04$Model$Rainfall S20.HT04$Prop
and the which the proporiton of the occasions in the original sample that rainfall is the most extreme of the drivers given that at least one driver is extreme.
The HT04 approach uses rejection sampling to generate synthetic records. The first step involves sampling a variable, conditioned to exceed the u_Dependence
threshold. A joint residual associated with the corresponding regression is independently sampled and other variables estimated using the fitted regression parameters. If the variable conditioned to be extreme in step one is not the most extreme the sample is rejected. The process is repeated until the relative proportion of simulated events where each variable is a maximum, conditional on being above the threshold, is consistent with the empirical distribution. Labelling the simulations S20.HT04.Sim
S20.HT04.Sim<-S20.HT04$x.sim
and now plotting the simulations from the HT04 model
S20.Pairs.Plot.Data<-data.frame(rbind(na.omit(S20.Detrend.df[,-1]),S20.HT04.Sim), c(rep("Observation",nrow(na.omit(S20.Detrend.df))), rep("Simulation",nrow(S20.HT04.Sim)))) colnames(S20.Pairs.Plot.Data)<-c(names(S20.Detrend.df)[-1],"Type") pairs(S20.Pairs.Plot.Data[,1:3], col=ifelse(S20.Pairs.Plot.Data$Type=="Observation","Black",alpha("Red",0.2)), upper.panel=NULL,pch=16)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.