In reumandc/mms: Model Selection Via Resampling for Matrix Models

The mms package provides tools to do model selection among matrix models via a leave-n-out cross validation and resampling approach. Section \ref{sect:intromm} provides a statement of what matrix modelling is, with references to background materials. Section \ref{sect:whymodselect} states why model selection for matrix models is needed. Section \ref{sect:exdata} describes an example dataset that is used in section \ref{sect:demoondata} to illustrate how mms package functions are used. Section \ref{sect:methoddetails} provides a connection to a detailed desription, provided elsewhere, of the model selection methods implemented in the mms package. Section \ref{sect:simstudy} provides a brief simulation study demonstrating the effectiveness of the methods.

Brief introduction to matrix regression\label{sect:intromm}

If $i=1,\ldots,P$ are sampling locations or other identifiers and the $P \times P$ matrices $R$ and $M_1,\ldots,M_n$ quantify different measures of pairwise similarities or differences between these identifiers, then matrix regression extends the widely used Mantel test by assessing the relationship between the response variable $R$ and the multiple predictor matrices $M_i$, while properly accounting for the non-independence that is intrisic to pairwise comparison matrices [@legendre_1994; @lichstein_2007; @walter_2017]. For instance, if time series of populations were measured over the same times in locations $1,\ldots,P$ then the $ij^{\text{th}}$ entry of $R$ could be a measure (e.g., correlation) of the synchrony through time between the population time series of locations $i$ and $j$. The $ij^{\text{th}}$ entry of $M_1$ could contain geographic distances between the locations $i$ and $j$. And the $ij^{\text{th}}$ entries of $M_l$ for $l=2,\ldots,n$ could contain correlations of time series of different environmental variables (one variable for each $l=2,\ldots,n$) measured for the same locations and times as the populations. With matrix regression, one can perform tests of the statistical importance of any of the $M_i$ ($i=1,\ldots,n$) as a predictor of $R$ while controlling for the other predictors. The matregtest function in the mms package implements such tests. The references cited above give background on matrix modelling and the ecodist package provides additional code supporting matrix regressions. Matrix regression is increasingly used in ecology and other fields [@legendre_1998; @bellamy_2003; @powney_2012; @haynes_2013; @anderson_2017].

Why model selection is needed for matrix models\label{sect:whymodselect}

Model selection via the Akaike or Bayesian Information Criteria (AIC, BIC) is now very commonly used to identify the relative support of data for alternative models which commonly represent alterative biological mechanisms that may explain an observed pattern [@burnham_2002; @clark_2006]. The practical advantages of model selection for matrix models over standard tests comparing a matrix regression model (e.g., with response $R$ and predictors $M_1,\ldots,M_n$, using notation from section \ref{sect:intromm}) to a simpler/nested model (e.g., with response $R$ and predictors some subset of $M_1,\ldots,M_n$; such tests are implemented by matregtest) are the same as the advantages of, for instance, AIC-based model selection among linear models over stepwise regression methods. Multiple models can be straightforwardly compared, and models need not be nested to make comparisons between them.

Information criteria such as the AIC and BIC cannot be used to compare matrix models because these techniques are based on likelihood, and matrix regression uses a resampling framework instead of likelihood. The mms package implements alternatives based on leave-n-out cross validation and resampling. A likelihood-based approach to matrix models was developed for applications in genetics [@yang_2004; @peterman_2014], but that approach cannot be used for applications for which the response matrix $R$ is a positive semi-definite matrix, since the approach does not preserve the positive semi-definiteness property. For instance, applications in which $R$ is a covariance or correlation matrix [@bellamy_2003; @powney_2012; @haynes_2013; @anderson_2017] are in this category. We recommend the use of our methods when positive semi-definiteness must be preserved, and the likelihood-based methods otherwise.

Example data\label{sect:exdata}

<!--

create the example data in the expected format - annualize time series

kyldat<-read.csv(file="../data/kylakedata.csv",header=T,stringsAsFactors = F) save(kyldat,file="../data/kyldat.RData")

create the example data in the expected format - dispersal difficulty matrix

kyldisp<-read.csv(file="../data/kylakedispersal.csv",header=T,stringsAsFactors = F,row.names=1) kyldisp2<-as.matrix(kyldisp) rownames(kyldisp2)<-rownames(kyldisp) colnames(kyldisp2)<-rownames(kyldisp) kyldisp<-kyldisp2 save(kyldisp,file="../data/kyldisp.RData") ``--> Data are stored in Dryad [@anderson_2017_dat] and were analyzed recently by @anderson_2017 to compare how dispersal and environmental drivers impact synchrony in freshwater plankton. Data and sampling methods were described in detail elsewhere [@bukaveckas_2002; @yurista_2004; @white_2007; @levine_2014; @anderson_2017]. Data are annual averages for 1990-2015 of measurements of various plankton and environmental quantities at 16 locations in the lake. Numerous variables were measured and are elaborated on in the package help file for the dataset (?kyldat`). We analyze chlorphyll-a densities, water temperatures, total N concentrations, and total P concentrations. For each of these variables, data comprise a 26 (number of years, 1990 to 2015) by 16 (number of sampling locations) matrix. No data were missing.

Model selection on matrix models and the package functions\label{sect:demoondata}

First make sure the time series data are loaded into memory:

library(mms)
data(kyldat)

set.seed(101)

Then these data are processed to get four $16 \times 16$ matrices of Spearman correlation coefficients between time series in different locations for each of the variables chlorphyll density, water temperature, total N concentrations, and total P concentrations:

makedatmat<-function(dat,var)
{
  dat<-dat[,c("Station","Year",var)]
  stations<-sort(unique(dat$Station))
  dat<-cbind(dat,StationNID=sapply(X=dat$Station,FUN=function(x){which(x == stations)}))
  mY<-min(dat$Year)
  MY<-max(dat$Year)
  res<-matrix(NA,MY-mY+1,length(stations))
  for (counter in 1:dim(dat)[1])
  {
    res[dat$Year[counter]-mY+1,dat$StationNID[counter]]<-dat[counter,var]
  }
  return(res)
}
datmats<-list(chlo=makedatmat(kyldat,"Chlorophyll"),
#              temp=makedatmat(kyldat,"Temperature"),
#              totN=makedatmat(kyldat,"TotalN"))
              totP=makedatmat(kyldat,"TotalP"))
cormats<-lapply(X=datmats,FUN=function(x){cor(x,method="spearman")})

Then a matrix containing geographic distances between the sampling locations is calculated:

sll<-unique(kyldat[,c("Station","LAT_DD","LONG_DD")])
ostat<-order(sll$Station)
sll<-sll[ostat,]
numlocs<-dim(sll)[1]
dist<-ncf::gcdist(sll$LONG_DD,sll$LAT_DD)

Then load the estimated dispersal difficulty matrix and reorder rows and columns appropriately

data(kyldisp)
disp<-kyldisp[ostat,ostat]

Now do some basic matrix regression tests (no model selection yet), using matregtest:

mats<-c(cormats,list(dist=dist,disp=disp))
#test significance of totP
res_d1<-matregtest(mats=mats,pred=2:4,drop=2,numperm=1000)
res_d1$p
#test the significance of dist
res_d2<-matregtest(mats=mats,pred=2:4,drop=3,numperm=1000)
res_d2$p
#test the significance of disp
res_d3<-matregtest(mats=mats,pred=2:4,drop=4,numperm=1000)
res_d3$p

The results suggest both total P and the dispersal difficulty matrices are necessary to explain the synchrony matrix for chlorophyll.

Now use model selection to work out relative importances. A typical use of the model selection tools in the package starts with mmsmodwts:

#, cache=T}
nrand<-25
n<-2
maxruns<-50
res<-mmsmodwts(mats=mats,model.names=NA,nrand=nrand,n=n,maxruns=maxruns,progress=F)
res

The second column shows the number of resamplings (of the r nrand performed) on which each model was ranked the highest. The fraction of resamplings for which a model is ranked highest is a number between $0$ and $1$ and can be interpreted as a model importance weight, similar to an AIC weight [@walter_2017]. The fraction is interpretable as the degree of support of the data for each model. The third column shows the number of possible leave-n-outs that could have been performed across all resamplings (for r dim(mats[[1]])[1] locations and using $n=$ r n gives r dim(mats[[1]])[1] choose r n = r choose(dim(mats[[1]])[1],n) possible leave-n-outs per resampling, times nrand= r nrand resamplings, which gives r choose(dim(mats[[1]])[1],n)*nrand). The fourth column gives the number of leave-n-outs that were attempted, less than the previous column because maxruns was r maxruns, less than r dim(mats[[1]])[1] choose r n. Regressions with rank deficiency problems do not yield well-determined regression coefficients and cannot be used for out-of-sample predictions. The column num.rnk is the number of leave-n-outs attempted for which rank deficiency problems did not occur. The final column is leave-n-outs that succeeded, providing an estimate of out-of-sample forcast accuracy. This can be less than num.rnk due to technicalities on how resampling and leave-n-out interact in the algorithm when the same site is selected more than once in the resampling. Values should decrease or stay the same as one progresses from column 3 to 4 to 5 to 6. The final column should still have large numbers for results to be reliable. Note that in this example we used small values if nrand and maxruns to keep overall vignette run times down but we recommend using substantially larger values for final analyses.

Model weights are summed across all models that contain a given predictor to get predictor importance weights, using mmsvarwts:

mmsvarwts(pred=2:4,weights=res,prednames=names(mats)[2:4])

Results show the most important variable is disp with totP and dist less important. This is consistent with the results of [@anderson_2017].

The functions mmsrank and mmsscore are called internally by any call to mmsmodwts, but they can also be called directly to yield rankings of models on their out-of-sample forcasting scores and the forecastng scores themselves, respectively.

Details of the model selection methods\label{sect:methoddetails}

The methods of the mms package are described in detail elsewhere [@walter_2017, their Supporting Information Appendix S4].

@walter_2017 also released, as their Supporting Information Appendix S10, a precursor version of some of the code in the mms package. The main matrix model selection functions of @walter_2017, called lno.score, lno.ranking, and lno.weights, are replaced by the functions mmsscore, mmsrank, and mmsmodwts in the mms package. As far as we are aware, there is nothing wrong with the code in lno.score, lno.ranking, and lno.weights, but the mms functions are better documented, are published together with their unit tests, and are illustrated below in applications. @walter_2017 also released the ancillary, bookkeeping function sum.var.weights, which we now know has a minor bug that will affect some applications. It is replaced by mmsvarwts in the mms package.

Simulation study of effectiveness of `mms` methods\label{sect:simstudy}

The effectiveness of matrix model selection can be demonstrated by applying mms to data generated by a theoretical model designed to simulate geographies of synchrony arising from one or more mechanisms [@walter_2017]. If mechanisms built into the model can be successfully inferred/reconstructed from simulation output using matrix model selection then the matrix model selection methods have been successful. Here, we describe a theoretical model and we use output from the model to test matrix model selection methods in this way. The model was originally decribed by @walter_2017. We present a single one of the several tests described there.

The model is an extension of a vector autoregressive moving-average (VARMA) model that generates time series of abundance that may be correlated across multiple locations. The model simulates populations at locations $i = 1,\ldots,P$, with populations linked by dispersal and subject to spatially correlated environmental fluctuations (Moran effects). Fluctuations about the within-patch carrying capacity are denoted $w_{i}(t)$ for time $t$. The $\epsilon_{i}^{(j)}(t)$ are environmental conditions for potential Moran drivers. Dispersal is implemented via a $P$ by $P$ connectivity matrix, $D$. Population density dependence is controlled by the autoregressive coefficients $m_{il}$, while the influence of environmental variables is given by the moving average coefficients $q_{il}^{(j)}$. The model is:

\begin{equation} \left( \begin{array}{c} w_1(t) \ \vdots \ w_P(t) \end{array} \right) \approx D \left( \begin{array}{c} \sum_{l = 1}^{a}m_{ 1l}w_{ 1}(t - l)+\sum_{j=1}^{b}\sum_{l=0}^{c}q_{1l}^{(j)}\epsilon_{1}^{(j)}(t-l) \ \vdots \ \sum_{l = 1}^{a}m_{ Pl}w_{ P}(t - l)+\sum_{j=1}^{b}\sum_{l=0}^{c}q_{Pl}^{(j)}\epsilon_{P}^{(j)}(t-l) \ \end{array} \right) \end{equation}

Here we consider a scenario in which spatial structure in population dynamics (i.e., geography of synchrony) arises from spatial structure in an environmental driver. The data produced are identical to those used to present "mechanism A"" in @walter_2017. We simulated population dynamics in $P=16$ locations divided between two disjoint eight-location sets, $S_1$ and $S_2$. $S_1$ was {1,...,8} and $S_2$ was {9,...,16}. All locations had identical 2nd-order density dependence with $m_{i1} \approx 0.375$ and $m_{i2} \approx -0.368$ ($a=2$). These parameters produce periodic oscillations with a dominant period length of $\approx$ 5 time steps. We simulated three environmental drivers: an "operating" driver that strongly influences population dynamics (i.e., $q_{i1}^{(1)}=1$, $q_{il}^{(1)}=0$ for $l>1$) and is spatially structured; a "latent" driver that is also spatially structured but does not influence population dynamics (i.e., $q_{il}^{(2)}=0$ for all $l$); and a third driver representing local variability that is spatially unstructured (uncorrelated in different locations) and has a small influence on population dynamics (i.e., $q_{i1}^{(3)}=0.1$, $q_{il}^{(3)}=0$ for $l>1$). Spatial structure in environmental drivers, which were independent and identically distributed through time, was controlled by the covariance matrices $\Omega_{j}$ for $j=1,2$, which were block matrices having off-diagonal entries $0.6$ within $S_1$ or $S_2$ and $0.3$ between groups. $\Omega_{j}$ was diagonal. Organisms dispersed among populations such that 5% of each population dispersed evenly among all other populations (this specifies $D$). Further details, including an exact mathematical specification of the model and its parameters, were given by @walter_2017. The model was run for 150 time steps with a 100 time step burn-in period.

We use simulated data to define a response matrix and multiple predictors, including a true predictor representing the operating mechanism in the simulation described above, and two spurious predictors representing latent (non-operating) possible mechanisms. The data are available in the package using:

data(simdat)
attach(simdat)

The object 'simdat' is a list containing four matrices. The first three are $P$ by $t$ matrices giving, respectively, abundances $w_{i}(t)$ and environmental fluctuations $\epsilon_{i}^{(j)}(t)$ for the operating ($j=1$) and latent ($j=2$) drivers. The fourth is the $P$ by $P$ dispersal matrix, $D$. We then compute synchrony matrices using Pearson correlations:

popsynch<-cor(t(simdat$pop), method="pearson")
driversynch<-cor(t(simdat$driver), method="pearson")
latentsynch<-cor(t(simdat$latent), method="pearson")

The lnocv score for an individual model is obtained using mmsscore, e.g. for a model in which dispersal is the sole predictor of population synchrony:

modmats<-list(popsynch=popsynch, driversynch=driversynch, latentsynch=latentsynch, dispersal=simdat$dispersal)
modscore<-mmsscore(mats=list(modmats$popsynch, modmats$dispersal), pred=2, n=2, maxruns=500)
print(modscore)

Note that in this case, and in those following, maxruns (and later, nrand) is set relatively low so that these examples can be run quickly. We recommend increasing these for more accurate results. A selection of models can be ranked using:

modranks<-mmsrank(mats=modmats, model.names=NA, n=2, maxruns=200,rank.mod=T)
print(modranks)

Here, we use model.names=NA (the default) to test all possible combinations of predictors. The first matrix in mats is assumed to be the response variable. Recall that in our example there is one operating environmental driver producing geography of synchrony. The model reflecting this receives the highest rank.

Model weights, corresponding to the frequency across resampled datasets with which a model is the best performing in the candidate set are obtained using:

#, cache=T, cache.extra=list(modmats)}
modweights<-mmsmodwts(mats=modmats, n=2, maxruns=200, nrand=50,progress=F)
print(modweights)

As for rankings, the correct model receives the highest weight. The model weights can be used further to generate variable importance weights:

varimp<-mmsvarwts(pred=2:4, weights=modweights, prednames=names(modmats)[-1])
print(varimp)

The algorithm correctly identifies the operating environmental driver as the most important predictor.

Acknowlegements

This material is based upon work supported by the National Science Foundation under Grant Numbers 17114195 and 1442595. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.