The mms
package provides tools to
do model selection among matrix models via a
leave-n-out cross validation and resampling approach. Section \ref{sect:intromm}
provides a statement of what matrix modelling is, with references to
background materials. Section \ref{sect:whymodselect} states why model
selection for matrix models is needed. Section \ref{sect:exdata} describes an example
dataset that is used in section \ref{sect:demoondata} to illustrate
how mms
package functions are used. Section \ref{sect:methoddetails}
provides a connection to a detailed desription, provided elsewhere,
of the model selection methods
implemented in the mms
package.
Section \ref{sect:simstudy} provides a brief simulation study
demonstrating the effectiveness of the methods.
If $i=1,\ldots,P$ are sampling locations or other identifiers and the $P \times P$
matrices $R$ and $M_1,\ldots,M_n$ quantify different measures of pairwise similarities
or differences between these identifiers, then matrix regression extends the
widely used Mantel test by assessing the relationship between the response
variable $R$ and the multiple predictor matrices $M_i$, while properly
accounting for the non-independence that is intrisic to pairwise
comparison matrices [@legendre_1994; @lichstein_2007; @walter_2017].
For instance, if time series of populations were measured over the same
times in locations $1,\ldots,P$ then the $ij^{\text{th}}$ entry of $R$
could be a measure (e.g., correlation) of the synchrony through time
between the population time series
of locations $i$ and $j$. The $ij^{\text{th}}$ entry of $M_1$ could contain
geographic distances
between the locations $i$ and $j$. And the $ij^{\text{th}}$ entries of
$M_l$ for $l=2,\ldots,n$ could contain correlations of time series of
different environmental variables (one variable for each $l=2,\ldots,n$)
measured for the same locations and times as the populations. With matrix regression,
one can perform tests of the
statistical importance of any of the $M_i$ ($i=1,\ldots,n$) as a predictor of $R$
while controlling for the other predictors. The matregtest
function in the
mms
package implements such tests.
The references cited above
give background on matrix modelling and the ecodist
package provides
additional code supporting matrix regressions. Matrix regression
is increasingly used in ecology and other
fields [@legendre_1998; @bellamy_2003; @powney_2012; @haynes_2013; @anderson_2017].
Model selection via the Akaike or Bayesian Information Criteria (AIC, BIC)
is now very commonly used to identify the relative support of data for
alternative models which commonly represent alterative biological mechanisms
that may explain an observed pattern [@burnham_2002; @clark_2006].
The practical advantages of model selection for matrix models over standard
tests comparing a matrix regression model (e.g., with response $R$ and predictors
$M_1,\ldots,M_n$, using notation from section \ref{sect:intromm}) to a
simpler/nested model (e.g., with response $R$ and predictors some subset
of $M_1,\ldots,M_n$; such tests are implemented by matregtest
) are the
same as the advantages of, for instance, AIC-based model selection among
linear models over stepwise regression methods. Multiple
models can be straightforwardly compared, and models need not be nested
to make comparisons between them.
Information criteria such as the AIC and BIC cannot be used to compare matrix
models because these techniques are based on likelihood, and matrix
regression uses a resampling framework instead of likelihood. The
mms
package implements alternatives based on leave-n-out cross
validation and resampling. A likelihood-based approach to matrix models
was developed for applications in genetics [@yang_2004; @peterman_2014],
but that approach cannot be used for applications for which the
response matrix $R$ is a positive semi-definite matrix, since the approach
does not preserve the positive semi-definiteness property. For instance,
applications in which $R$ is a covariance or correlation matrix
[@bellamy_2003; @powney_2012; @haynes_2013; @anderson_2017] are in this
category. We recommend the use of our methods when positive semi-definiteness
must be preserved, and the likelihood-based methods otherwise.
<!--
kyldat<-read.csv(file="../data/kylakedata.csv",header=T,stringsAsFactors = F) save(kyldat,file="../data/kyldat.RData")
kyldisp<-read.csv(file="../data/kylakedispersal.csv",header=T,stringsAsFactors = F,row.names=1)
kyldisp2<-as.matrix(kyldisp)
rownames(kyldisp2)<-rownames(kyldisp)
colnames(kyldisp2)<-rownames(kyldisp)
kyldisp<-kyldisp2
save(kyldisp,file="../data/kyldisp.RData")
``-->
Data are stored in Dryad [@anderson_2017_dat] and were analyzed recently by @anderson_2017 to compare how dispersal and environmental drivers impact synchrony in freshwater plankton.
Data and sampling methods were described in detail elsewhere [@bukaveckas_2002; @yurista_2004; @white_2007; @levine_2014; @anderson_2017].
Data are annual averages for 1990-2015 of measurements
of various plankton and environmental quantities at 16 locations in the lake.
Numerous variables were measured and are elaborated on in the package help file for the dataset
(
?kyldat`). We analyze chlorphyll-a densities, water temperatures, total N concentrations,
and total P concentrations. For each of these variables, data comprise a 26 (number of years, 1990 to 2015)
by 16 (number of sampling locations) matrix. No data were missing.
First make sure the time series data are loaded into memory:
library(mms) data(kyldat)
set.seed(101)
Then these data are processed to get four $16 \times 16$ matrices of Spearman correlation coefficients between time series in different locations for each of the variables chlorphyll density, water temperature, total N concentrations, and total P concentrations:
makedatmat<-function(dat,var) { dat<-dat[,c("Station","Year",var)] stations<-sort(unique(dat$Station)) dat<-cbind(dat,StationNID=sapply(X=dat$Station,FUN=function(x){which(x == stations)})) mY<-min(dat$Year) MY<-max(dat$Year) res<-matrix(NA,MY-mY+1,length(stations)) for (counter in 1:dim(dat)[1]) { res[dat$Year[counter]-mY+1,dat$StationNID[counter]]<-dat[counter,var] } return(res) } datmats<-list(chlo=makedatmat(kyldat,"Chlorophyll"), # temp=makedatmat(kyldat,"Temperature"), # totN=makedatmat(kyldat,"TotalN")) totP=makedatmat(kyldat,"TotalP")) cormats<-lapply(X=datmats,FUN=function(x){cor(x,method="spearman")})
Then a matrix containing geographic distances between the sampling locations is calculated:
sll<-unique(kyldat[,c("Station","LAT_DD","LONG_DD")]) ostat<-order(sll$Station) sll<-sll[ostat,] numlocs<-dim(sll)[1] dist<-ncf::gcdist(sll$LONG_DD,sll$LAT_DD)
Then load the estimated dispersal difficulty matrix and reorder rows and columns appropriately
data(kyldisp) disp<-kyldisp[ostat,ostat]
Now do some basic matrix regression tests (no model selection yet),
using matregtest
:
mats<-c(cormats,list(dist=dist,disp=disp)) #test significance of totP res_d1<-matregtest(mats=mats,pred=2:4,drop=2,numperm=1000) res_d1$p #test the significance of dist res_d2<-matregtest(mats=mats,pred=2:4,drop=3,numperm=1000) res_d2$p #test the significance of disp res_d3<-matregtest(mats=mats,pred=2:4,drop=4,numperm=1000) res_d3$p
The results suggest both total P and the dispersal difficulty matrices are necessary to explain the synchrony matrix for chlorophyll.
Now
use model selection to work out relative importances. A typical
use of the model selection tools in the package starts with mmsmodwts
:
#, cache=T} nrand<-25 n<-2 maxruns<-50 res<-mmsmodwts(mats=mats,model.names=NA,nrand=nrand,n=n,maxruns=maxruns,progress=F) res
The second column shows the number of resamplings (of the r nrand
performed) on which each model
was ranked the highest. The fraction of resamplings for which a model is ranked highest is
a number between $0$ and $1$ and can be interpreted
as a model importance weight, similar to an AIC weight [@walter_2017]. The fraction
is interpretable as the degree of support of the data for each model. The third column
shows the number of possible leave-n-outs that could have been performed across all resamplings
(for r dim(mats[[1]])[1]
locations and using $n=$ r n
gives r dim(mats[[1]])[1]
choose r n
=
r choose(dim(mats[[1]])[1],n)
possible leave-n-outs per resampling, times nrand
= r nrand
resamplings,
which gives r choose(dim(mats[[1]])[1],n)*nrand
). The fourth column gives the number of leave-n-outs
that were attempted, less than the previous column because maxruns
was r maxruns
, less than
r dim(mats[[1]])[1]
choose r n
. Regressions with rank deficiency problems do not yield well-determined
regression coefficients and cannot be used for out-of-sample predictions. The column
num.rnk
is the number of leave-n-outs attempted for which rank deficiency problems did not occur.
The final column is leave-n-outs that succeeded, providing an estimate of out-of-sample forcast
accuracy. This can be less than num.rnk
due to technicalities on how resampling and leave-n-out
interact in the algorithm when the same site is selected more than once in the resampling.
Values should decrease or stay the same as one progresses from column 3 to 4 to 5 to 6. The final column
should still have large numbers for results to be reliable. Note that in this example we used
small values if nrand
and maxruns
to keep overall vignette run times down but we recommend using
substantially larger values for final analyses.
Model weights are summed across all models that contain a given predictor to get
predictor importance weights, using mmsvarwts
:
mmsvarwts(pred=2:4,weights=res,prednames=names(mats)[2:4])
Results show the most important variable is disp
with totP
and dist
less important.
This is consistent with the results of [@anderson_2017].
The functions mmsrank
and mmsscore
are called internally by any call to mmsmodwts
, but they
can also be called directly to yield rankings of models on their out-of-sample forcasting scores and
the forecastng scores themselves, respectively.
The methods of the mms
package are described in detail elsewhere
[@walter_2017, their Supporting Information Appendix S4].
@walter_2017 also released, as their Supporting Information Appendix S10,
a precursor version of some of the code in the mms
package. The main
matrix model selection functions of @walter_2017, called lno.score
,
lno.ranking
, and lno.weights
, are replaced by the functions
mmsscore
, mmsrank
, and mmsmodwts
in the mms
package.
As far as we are aware, there is nothing wrong with the code in
lno.score
, lno.ranking
, and lno.weights
, but the mms
functions
are better documented, are published together with their unit tests,
and are illustrated below in applications.
@walter_2017 also released the ancillary, bookkeeping function
sum.var.weights
, which we now know has a minor bug that will affect some
applications. It is replaced
by mmsvarwts
in the mms
package.
mms
methods\label{sect:simstudy}The effectiveness of matrix model selection can be demonstrated by applying mms
to data generated by a theoretical model designed to simulate geographies of synchrony arising from one or more mechanisms [@walter_2017]. If mechanisms built into the model can be successfully inferred/reconstructed from simulation
output using matrix model selection then the matrix model selection methods have been successful.
Here, we describe a theoretical model and we use output from the model to test matrix model selection methods
in this way.
The model was originally decribed by @walter_2017. We present a single one of the several tests described there.
The model is an extension of a vector autoregressive moving-average (VARMA) model that generates time series of abundance that may be correlated across multiple locations. The model simulates populations at locations $i = 1,\ldots,P$, with populations linked by dispersal and subject to spatially correlated environmental fluctuations (Moran effects). Fluctuations about the within-patch carrying capacity are denoted $w_{i}(t)$ for time $t$. The $\epsilon_{i}^{(j)}(t)$ are environmental conditions for potential Moran drivers. Dispersal is implemented via a $P$ by $P$ connectivity matrix, $D$. Population density dependence is controlled by the autoregressive coefficients $m_{il}$, while the influence of environmental variables is given by the moving average coefficients $q_{il}^{(j)}$. The model is:
\begin{equation} \left( \begin{array}{c} w_1(t) \ \vdots \ w_P(t) \end{array} \right) \approx D \left( \begin{array}{c} \sum_{l = 1}^{a}m_{ 1l}w_{ 1}(t - l)+\sum_{j=1}^{b}\sum_{l=0}^{c}q_{1l}^{(j)}\epsilon_{1}^{(j)}(t-l) \ \vdots \ \sum_{l = 1}^{a}m_{ Pl}w_{ P}(t - l)+\sum_{j=1}^{b}\sum_{l=0}^{c}q_{Pl}^{(j)}\epsilon_{P}^{(j)}(t-l) \ \end{array} \right) \end{equation}
Here we consider a scenario in which spatial structure in population dynamics (i.e., geography of synchrony) arises from spatial structure in an environmental driver. The data produced are identical to those used to present "mechanism A"" in @walter_2017. We simulated population dynamics in $P=16$ locations divided between two disjoint eight-location sets, $S_1$ and $S_2$. $S_1$ was {1,...,8} and $S_2$ was {9,...,16}. All locations had identical 2nd-order density dependence with $m_{i1} \approx 0.375$ and $m_{i2} \approx -0.368$ ($a=2$). These parameters produce periodic oscillations with a dominant period length of $\approx$ 5 time steps. We simulated three environmental drivers: an "operating" driver that strongly influences population dynamics (i.e., $q_{i1}^{(1)}=1$, $q_{il}^{(1)}=0$ for $l>1$) and is spatially structured; a "latent" driver that is also spatially structured but does not influence population dynamics (i.e., $q_{il}^{(2)}=0$ for all $l$); and a third driver representing local variability that is spatially unstructured (uncorrelated in different locations) and has a small influence on population dynamics (i.e., $q_{i1}^{(3)}=0.1$, $q_{il}^{(3)}=0$ for $l>1$). Spatial structure in environmental drivers, which were independent and identically distributed through time, was controlled by the covariance matrices $\Omega_{j}$ for $j=1,2$, which were block matrices having off-diagonal entries $0.6$ within $S_1$ or $S_2$ and $0.3$ between groups. $\Omega_{j}$ was diagonal. Organisms dispersed among populations such that 5% of each population dispersed evenly among all other populations (this specifies $D$). Further details, including an exact mathematical specification of the model and its parameters, were given by @walter_2017. The model was run for 150 time steps with a 100 time step burn-in period.
We use simulated data to define a response matrix and multiple predictors, including a true predictor representing the operating mechanism in the simulation described above, and two spurious predictors representing latent (non-operating) possible mechanisms. The data are available in the package using:
data(simdat) attach(simdat)
The object 'simdat' is a list containing four matrices. The first three are $P$ by $t$ matrices giving, respectively, abundances $w_{i}(t)$ and environmental fluctuations $\epsilon_{i}^{(j)}(t)$ for the operating ($j=1$) and latent ($j=2$) drivers. The fourth is the $P$ by $P$ dispersal matrix, $D$. We then compute synchrony matrices using Pearson correlations:
popsynch<-cor(t(simdat$pop), method="pearson") driversynch<-cor(t(simdat$driver), method="pearson") latentsynch<-cor(t(simdat$latent), method="pearson")
The lnocv score for an individual model is obtained using mmsscore
, e.g. for a model in which dispersal is the sole predictor of population synchrony:
modmats<-list(popsynch=popsynch, driversynch=driversynch, latentsynch=latentsynch, dispersal=simdat$dispersal) modscore<-mmsscore(mats=list(modmats$popsynch, modmats$dispersal), pred=2, n=2, maxruns=500) print(modscore)
Note that in this case, and in those following, maxruns
(and later, nrand
) is set relatively low so that these examples can be run quickly. We recommend increasing these for more accurate results.
A selection of models can be ranked using:
modranks<-mmsrank(mats=modmats, model.names=NA, n=2, maxruns=200,rank.mod=T) print(modranks)
Here, we use model.names=NA
(the default) to test all possible combinations of predictors.
The first matrix in mats
is assumed to be the response variable.
Recall that in our example there is one operating environmental driver producing geography of synchrony.
The model reflecting this receives the highest rank.
Model weights, corresponding to the frequency across resampled datasets with which a model is the best performing in the candidate set are obtained using:
#, cache=T, cache.extra=list(modmats)} modweights<-mmsmodwts(mats=modmats, n=2, maxruns=200, nrand=50,progress=F) print(modweights)
As for rankings, the correct model receives the highest weight. The model weights can be used further to generate variable importance weights:
varimp<-mmsvarwts(pred=2:4, weights=modweights, prednames=names(modmats)[-1]) print(varimp)
The algorithm correctly identifies the operating environmental driver as the most important predictor.
This material is based upon work supported by the National Science Foundation under Grant Numbers 17114195 and 1442595. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.