robFitConGraph: Graph-constrained robust scatter estimation.
In robFitConGraph: Graph-Constrained Robust Covariance Estimation

View source: R/MLEfunc.R

robFitConGraph

R Documentation

Graph-constrained robust scatter estimation.

Description

The function computes a robust estimate of a scatter matrix subject to zero-constraints in its inverse. The methodology is described in Vogel & Tyler (2014).

Usage

robFitConGraph(
  X,
  amat,
  df = 3,
  tol = 1e-05,
  sigma1,
  plug_in = TRUE,
  direct = FALSE
)

Arguments

`X`	A data matrix with n rows and p columns, representing n observations and p variables. Elements of `X` must be numeric and n must be at least p+1. Rows containing NA are omitted.
`amat`	A p times p matrix representing the adjacency matrix of a graphical model. `amat` must be symmetric with numerical entries `0` or `1`. Alternatively, a Boolean matrix. The entries on the diagonal are irrelevant. Must not contain any NAs.
`df`	A positive real number or `Inf`. The degrees of freedom of the t-distribution used (see Details below). The value `df = Inf` corresponds to the sample covariance matrix. The value `df = 0` is not allowed as Tyler's M-estimator is currently not implemented. Default value is `3`.
`tol`	tolerance for numerical convergence. Iteration stops if the maximal element-wise difference between two successive matrices is less than `tol`. Must be at least `10e-14`. Default is `1e-5`.
`sigma1`	numerical. A positive real value. This value is needed for the p-value of the model fit. If missing, it is computed by `find_sigma1(df_data = Inf, df_est = df, p = ncol(X))`. See Details below and the documentation of the function `find_sigma1`.
`plug_in`	logical. The function offers two types of estimates: the plug-in M-estimator and the direct M-estimator. If `plug_in` is `TRUE`, the plug-in estimate is computed. If `FALSE`, the direct M-estimator is computed. The plug-in estimator is faster, but has higher variance. Default is `TRUE`. Ignored if `df == Inf`.
`direct`	logical. If `TRUE`, the direct estimate is computed, otherwise the plug-in estimate. Default is `FALSE`. In case of conflicting specifications of `plug-in` and `direct`, `plug_in` overrides `direct`. Ignored if `df == Inf`.

Details

Two types of graph-constrained M-estimates of scatter based on the elliptical t-distribution are implemented: the direct estimate and the plug-in estimate. The direct estimate is referred to as graphical M-estimator in Vogel & Tyler (2014).

The plug-in estimate is two algorithms performed sequentially: First an unconstrained t-maximum likelihood estimate of scatter is computed (the same as cov.trob from MASS). This is then plugged into the Gaussian graphical model fitting routine (the same as fitConGraph from ggm). Specifically Algorithm 17.1 from Hastie, Tibshirani, Friedman (2009) is used.

The direct estimate is the actual maximum-likelihood estimator within the elliptical graphical model based on the elliptical t-distribution. The algorithm is an iteratively-reweighted least-squares algorithm, where the Gaussian graphical model fitting procedure is nested into the t-estimation iteration. The direct estimate therefore takes longer to compute, but the estimator has a better statistical efficiency for small sample sizes. Both estimators are asymptotically equivalent. The estimates tend to be very close to each other for large sample sizes.

The deviance test statistic

The robustified deviance test statistic (short: deviance) for testing a graphical model G is

D_n = n ( log det(S_n(G)) - log det(S_n) ),

where S_n(G) is a graph-constrained scatter estimator and S_n the corresponding unconstrained scatter estimator, see Vogel & Tyler (2014, p. 866 bottom). Under the graphical model G, the deviance D_n converges to σ_1 χ_r^2, where r is the number of missing edges in G. The constant σ_1 depends on the scatter estimate S_n, the dimension p, and the population distribution. It can either be provided directly (as sigma1) or, if missing, is computed by find_sigma1, assuming a Gaussian population distribution.

Although robFitConGraph combines the functionality of fitConGraph and cov.trob and contains both as special cases, it uses neither. The algorithms are implemented in C++.

Input and output of robFitConGraph are intended to resemble fitConGraph from the package ggm. Some notable differences:

fitConGraph takes as input the unconstrained covariance matrix; robFitConGraph takes the actual data.
fitConGraph alternatively offers to specify the graphical model as list of cliques; robFitConGraph only takes the adjacency matrix amat.
fitConGraph returns a value called the degrees of freedom as df. These degrees of freedom mean the number of missing edges, which appear as the degrees of freedom of the chi-square distribution of the deviance test. However, these degrees of freedom are completely unrelated to the the argument df_est of robFitConGraph, which refers to the degrees of freedom of the t-distribution defining the scatter estimate. robFitConGraph returns the number of missing edges as missing_edges.

Value

List with 8 elements:

`Shat`	p times p symmetric scatter matrix estimate.
`mu`	numerical p-vector, the location estimate In case of `df == Inf`, this is the sample mean.
`pval`	numerical. The p-value of the model fitted. D_n/σ_1 is compared to the χ_r^2 distribution, see Details.
`dev`	numerical. The deviance test statistic D_n, see Details.
`missing_edges`	integer. The number of missing edges r as indicated by the argument `amat`.
`sigma1`	numerical. Either the argument `sigma1` (if provided), or the return value of `find_sigma1(df_data = Inf, df_est = df, p = ncol(X))`
`em_it`	integer. Number of iterations of the t-MLE computation.
`ips_it`	integer. In the case of the plug-in estimate, this is the number of iterations of the Gaussian graphical model fitting procedure (Algorithm 17.1) in Hastie et al 2004). In the case of the direct estimate, the Gaussian graphical model fitting is executed `em_it` times, and `ips_it` returns the average number of iterations.

The last five return values are mainly for traceability purposes. Particularly dev, missing_edges, and sigma1 are easily obtained by the inputs or outputs of robFitConGraph. They are the ingredients needed to compute the p-value:

pval <- pchisq(q = dev/sigma1, df = missing_edges, lower.tail = FALSE)

Author(s)

Stuart Watt, Daniel Vogel

References

Vogel, D., Tyler, D. E. (2014): Robust estimators for nondecomposable elliptical graphical models, Biometrika, 101, 865-882

Hastie, T., Tibshirani, R. and Friedman, J. (2004). The elements of statistical learning. New York: Springer.

Examples

# --- Example 1: anxieties data ---

# The data set 'anxieties' contains 8 anxiety variables of 82 observations. We test the
# null hypothesis that, given Generalized Anxiety (GAD), Music performance anxiety (MPA)
# is conditionally independent of all remaining 6 variables:

# - build the corresponding graphical model -
data(anxieties)
amat = matrix(1,ncol=ncol(anxieties),nrow=ncol(anxieties))
colnames(amat) <- colnames(anxieties)
rownames(amat) <- colnames(amat)
amat["MPA",c("SAD","PD","AG","SP","SEP","ILL")] <- 0
amat[c("SAD","PD","AG","SP","SEP","ILL"),"MPA"] <- 0
amat
#     MPA GAD SAD PD AG SP SEP ILL
# MPA   1   1   0  0  0  0   0   0
# GAD   1   1   1  1  1  1   1   1
# SAD   0   1   1  1  1  1   1   1
# PD    0   1   1  1  1  1   1   1
# AG    0   1   1  1  1  1   1   1
# SP    0   1   1  1  1  1   1   1
# SEP   0   1   1  1  1  1   1   1
# ILL   0   1   1  1  1  1   1   1

robFitConGraph(X = anxieties, amat = amat)$pval
# [1] 0.881159
# This provides no evidence against the null hypothesis.

# the non-robust, sample-covariance-based model fit:
robFitConGraph(X = anxieties, amat = amat, df = Inf)$pval
# [1] 0.6310895
# ...provides no evidence against the null hypothesis either.

# The latter is also obtained as:
# pchisq(q = ggm::fitConGraph(S = cov(anxieties), amat = amat, 
#                             n = nrow(anxieties))$dev, 
#        df = 6, lower.tail = FALSE)



# --- Example 2: simulated data - the chordless 7-cycle ---

# - build the corresponding graphical model -
chordless_p_cycle <- function(rho, p){
  M <- diag(1,p)
  for (i in 1:(p-1)) M[i,i+1] <- M[i+1,i] <- -rho
  M[1,p] <- M[p,1] <- -rho
  return(M)
}
p <- 7                             # number of variables
rho <- 0.4                         # partial correlation
PCM <- chordless_p_cycle(rho, p)    # partial correlation matrix
SM <- cov2cor(solve(PCM))          # shape matrix (i.e covariance matrix up to scale)
model <- abs(sign(PCM))            # adjacency matrix of the chordless-7-cycle
# > model
#      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,]    1    1    0    0    0    0    1
# [2,]    1    1    1    0    0    0    0
# [3,]    0    1    1    1    0    0    0
# [4,]    0    0    1    1    1    0    0
# [5,]    0    0    0    1    1    1    0
# [6,]    0    0    0    0    1    1    1
# [7,]    1    0    0    0    0    1    1

# This is the cordless-7-cycle (p.872 Figure 1 (a) in Vogel & Tyler, 2014).
# All non-zero partial correlations are 0.4.
# The true covariance is (up to scale) 'SM'. This matrix is constructed such
# that it has zero entries in its inverse as specified by 'model'.


# - generate data from the graphical model (elliptical t3 distribution) -
n <- 50            # number of observations
df_data <- 3       # degrees of freedom
library(mvtnorm)   # for rmvt function
set.seed(918273)   # for reproducability
X <- rmvt(n = n, sigma = SM, df = df_data)

# X contains a data set of size n = 50 and dimension p = 7, sampled from the
# elliptical t-distribution with 3 degrees of freedom and shape matrix 'SM'


# - comparing estimates -

# the true correlation matrix:
round(cov2cor(SM), d = 2)

# Pearson correlations:
round(cov2cor(cov(X)), d = 2)

# robust correlations based on the direct graph-constrained t-MLE:
S1 <- robFitConGraph(X, amat = model, df = df_data, direct = TRUE)$Shat
round(cov2cor(S1), d = 2)

# robust correlations based on the plug_in graph-constrained t-MLE:
S2 <- robFitConGraph(X, amat = model, df = df_data, plug_in = TRUE)$Shat
round(cov2cor(S2), d = 2)

# The correlation estimates based on S1 and S2 are close to the true
# correlations, whereas the sample correlations differ strongly.
# Note: sample correlations are not asymptotically normal at the t3 distribution.

robFitConGraph documentation built on Dec. 1, 2022, 1:21 a.m.