mat: Modern Analogue Technique transfer function models
In gavinsimpson/analogue: Analogue and Weighted Averaging Methods for Palaeoecology

View source: R/mat.R

mat	R Documentation

Modern Analogue Technique transfer function models

Description

Modern Analogue Technique (MAT) transfer function models for palaeoecology. The fitted values are the, possibly weighted, averages of the environment for the k-closest modern analogues. MAT is a k-NN method.

Usage

mat(x, ...)

## Default S3 method:
mat(x, y,
    method = c("euclidean", "SQeuclidean", "chord", "SQchord",
               "bray", "chi.square", "SQchi.square",
               "information", "chi.distance", "manhattan",
               "kendall", "gower", "alt.gower", "mixed"),
    kmax, ...)

## S3 method for class 'formula'
mat(formula, data, subset, na.action,
    method = c("euclidean", "SQeuclidean", "chord", "SQchord",
               "bray", "chi.square", "SQchi.square",
               "information", "chi.distance", "manhattan",
               "kendall", "gower", "alt.gower", "mixed"),
    model = FALSE, ...)

## S3 method for class 'mat'
fitted(object, k, weighted = FALSE, ...)

## S3 method for class 'mat'
residuals(object, k, weighted = FALSE, ...)

Arguments

`x`	a data frame containing the training set data, usually species data.
`y`	a vector containing the response variable, usually environmental data to be predicted from `x`.
`formula`	a symbolic description of the model to be fit. The details of model specification are given below.
`data`	an optional data frame, list or environment (or object coercible by `as.data.frame` to a data frame) containing the variables in the model. If not found in `data`, the variables are taken from `environment(formula)`, typically the environment from which `wa` is called.
`subset`	an optional vector specifying a subset of observations to be used in the fitting process.
`na.action`	a function which indicates what should happen when the data contain `NA`s. The default is set by the `na.action` setting of `options`, and is `na.fail` if that is unset. The "factory-fresh" default is `na.omit`. Another possible value is `NULL`, no action. Value `na.exclude` can be useful.
`method`	a character string indicating the dissimilarity (distance) coefficient to be used to define modern analogues. See Details, below.
`model`	logical; If `TRUE` the model frame of the fit is returned.
`kmax`	numeric; limit the maximum number of analogues considered during fitting. By default, `kmax` is equal to `n - 1`, where `n` is the number of sites. For large data sets this is just wasteful as we wouldn't expect to be averaging over the entire training set. `kmax` can be used to restrict the upper limit on the number of analogues considered.
`object`	an object of class `mat`.
`k`	numeric; the k-closest analogue models' for which fitted values and residuals are returned. Overides the default stored in the object.
`weighted`	logical; should weighted averages be used instead of simple averages?
`...`	arguments can be passed to `distance` to provide additional optios required for some dissimilarities.

Details

The Modern Analogue Technique (MAT) is perhaps the simplest of the transfer function models used in palaeoecology. An estimate of the environment, x, for the response for a fossil sample, y, is the, possibly weighted, mean of that variable across the k-closest modern analogues selected from a modern training set of samples. If used, weights are the reciprocal of the dissimilarity between the fossil sample and each modern analogue.

A typical model has the form response ~ terms where response is the (numeric) response data frame and terms is a series of terms which specifies a linear predictor for response. A typical form for terms is ., which is shorthand for "all variables" in data. If . is used, data must also be provided. If specific species (variables) are required then terms should take the form spp1 + spp2 + spp3.

Pairwise sample dissimilarity is defined by dissimilarity or distance coefficients. A variety of coefficients are supported — see distance for details of the supported coefficients.

k is chosen by the user. The simplest choice for k is to evaluate the RMSE of the difference between the predicted and observed values of the environmental variable of interest for the training set samples for a sequence of models with increasing k. The number of analogues chosen is the value of k that has lowest RMSE. However, it should be noted that this value is biased as the data used to build the model are also used to test the predictive power.

An alternative approach is to employ an optimisation data set on which to evaluate the size of k that provides the lowest RMSEP. This may be impractical with smaller sample sizes.

A third option is to bootstrap re-sample the training set many times. At each bootstrap sample, predictions for samples in the bootstrap test set can be made for k = 1, ..., n, where n is the number of samples in the training set. k can be chosen from the model with the lowest RMSEP. See function bootstrap.mat for further details on choosing k.

The output from summary.mat can be used to choose k in the first case above. For predictions on an optimsation or test set see predict.mat. For bootstrap resampling of mat models, see bootstrap.mat.

The fitted values are for the training set and are taken as the, possibly weighted, mean of the environmental variable in question across the k-closest analogues. The fitted value for each sample does not include a contribution from itself — it is the closest analogue, having zero dissimilarity. This spurious distance is ignored and analogues are ordered in terms of the non-zero distances to other samples in the training set, with the k-closest contributing to the fitted value.

Typical usages for residuals.mat are:

    resid(object, k, weighted = FALSE, \dots)

Value

mat returns an object of class mat with the following components:

`standard`	list; the model statistics based on simple averages of k-closest analogues. See below.
`weighted`	list; the model statistics based on weighted of k-closest analogues. See below.
`Dij`	matrix of pairwise sample dissimilarities for the training set `x`.
`orig.x`	the original training set data.
`orig.y`	the original environmental data or response, `y`.
`call`	the matched function call.
`method`	the dissimilarity coefficient used.

If model = TRUE then additional components "terms" and "model" are returned containing the terms object and model frame used.

fitted.mat returns a list with the following components:

`estimated`	numeric; a vector of fitted values.
`k`	numeric; this is the k-closest analogue model with lowest apparent RMSE.
`weighted`	logical; are the fitted values the weighted averages of the environment for the k-closest analogues. If `FALSE`, the fitted values are the average of the environment for the k-closest analogues.

Note

The object returned by mat contains lists "standard" and "weighted" both with the following elements:

est: a matrix of estimated values for the training set samples for models using k analogues, where k = 1, ..., n. n is the number of smaples in the training set. Rows contain the values for each model of size k, with colums containing the estimates for each training set sample.
resid: matrix; as for "est", but containing the model residuals.
rmsep: vector; containing the leave-one-out root mean square error or prediction.
avg.bias: vector; contains the average bias (mean of residuals) for models using k analogues, where k = 1, ..., n. n is the number of smaples in the training set.
max.bias: vector; as for "avg.bias", but containing the maximum bias statistics.
r.squared: vector; as for "avg.bias", but containing the R^2 statistics.

Author(s)

Gavin L. Simpson

References

Gavin, D.G., Oswald, W.W., Wahl, E.R. and Williams, J.W. (2003) A statistical approach to evaluating distance metrics and analog assignments for pollen records. Quaternary Research 60, 356–367.

Overpeck, J.T., Webb III, T. and Prentice I.C. (1985) Quantitative interpretation of fossil pollen spectra: dissimilarity coefficients and the method of modern analogues. Quaternary Research 23, 87–108.

Prell, W.L. (1985) The stability of low-latitude sea-surface temperatures: an evaluation of the CLIMAP reconstruction with emphasis on the positive SST anomalies, Report TR 025. U.S. Department of Energy, Washington, D.C.

Sawada, M., Viau, A.E., Vettoretti, G., Peltier, W.R. and Gajewski, K. (2004) Comparison of North-American pollen-based temperature and global lake-status with CCCma AGCM2 output at 6 ka. Quaternary Science Reviews 23, 87–108.

Examples

## Imbrie and Kipp Sea Surface Temperature
data(ImbrieKipp)
data(SumSST)
data(V12.122)

## merge training set and core samples
dat <- join(ImbrieKipp, V12.122, verbose = TRUE)

## extract the merged data sets and convert to proportions
ImbrieKipp <- dat[[1]] / 100
ImbrieKippCore <- dat[[2]] / 100

## fit the MAT model using the squared chord distance measure
ik.mat <- mat(ImbrieKipp, SumSST, method = "chord")
ik.mat

## model summary
summary(ik.mat)

## fitted values
fitted(ik.mat)

## model residuals
resid(ik.mat)

## draw summary plots of the model
par(mfrow = c(2,2))
plot(ik.mat)
par(mfrow = c(1,1))

## reconstruct for the V12.122 core data
coreV12.mat <- predict(ik.mat, V12.122, k = 3)
coreV12.mat
summary(coreV12.mat)

## draw the reconstruction
reconPlot(coreV12.mat, use.labels = TRUE, display.error = "bars",
          xlab = "Depth", ylab = "SumSST")

## fit the MAT model using the squared chord distance measure
## and restrict the number of analogues we fit models for to 1:20
ik.mat2 <- mat(ImbrieKipp, SumSST, method = "chord", kmax = 20)
ik.mat2

gavinsimpson/analogue documentation built on June 12, 2025, 7:35 p.m.