lrpsadmm.cv: Perform K-fold cross-validation for the Low-Rank plus Sparse...

Description Usage Arguments Details Value See Also Examples

View source: R/cross_validate.R

Description

Performs K-fold cross-validation in order to select the tuning parameters lambda and gamma.

Recall that the penalty for the LRpS estimator is written as λ_1 ||S||_1 + λ_2 Trace(L) in the objective function of lrpsadmm. This can be equivalently rewritten in terms of the regularisation parameters λ and γ as follows

λ γ ||S||_1 + λ (1 - γ) Trace(L),

for γ \in (0, 1).

For a given value of γ, one can perform cross-validation along the regularisation path in order to choose λ. This function computes the regularisation paths for each value of γ supplied as arguments and performs cross-validation. The pair (λ, γ) that produces the smallest cross-validated log-likelihood is returned.

Usage

1
2
3
4
5
6
lrpsadmm.cv(X, gammas = c(0.05, 0.1, 0.15), covariance.estimator = cor,
  n.folds = 5, lambdas = NULL, lambda.max = NULL,
  lambda.ratio = 1e-04, n.lambdas = 20, max.sparsity = 0.5,
  max.rank = NA, abs_tol = 1e-05, rel_tol = 0.001, max.iter = 2000,
  mu = 1, verbose = FALSE, seed = NA, zeros = NULL,
  backend = "RcppEigen")

Arguments

X

n x p data matrix

gammas

A real or a vector of reals between 0 and 1 (non inclusive). For each value of gamma the regularisation path is computed and cross-validation is performed. See examples for guidance on how to choose reasonable values for gamma. Too high a value might result in the problem being under-identified and therefore numerical instabilities.

covariance.estimator

A function that takes a data matrix and outputs an estimate of its correlation matrix. Default: the sample correlation matrix output by the cor function.

n.folds

Number of folds for cross-validation. Default 5.

lambdas

A decreasing sequence of values of lambda. See Details for the default values.

lambda.max

A positive real. Maximum value of lambda. See Details.

lambda.ratio

A real between 0 and 1. The smallest value of lambda is given by lambda.max * lambda.ratio. See Details.

n.lambdas

A positive integer. The number of values of lambda to generate according a geometric sequence between lambda.max and lambda.max * lambda.ratio. See Details.

max.sparsity

A real between 0 and 1. Abort the computation of the path if S becomes denser than this value.

max.rank

A real between 0 and 1. Abort the computuation of the path if the rank of L becomes higher than this value.

abs_tol

abs_tol parameter of the lrpsadmm function.

rel_tol

rel_tol parameter of the lrpsadmm function.

max.iter

max.iter parameter of the lrpsadmm function.

mu

mu parameter of the lrpsadmm function.

verbose

A boolean. Whether to print the value of lambda, gamma, sparsity of S, etc... after each fit

seed

Set the seed of the random number generator used for the K folds.

zeros

A p x p matrix with entries set to 0 or 1. Whereever its entries are 0, the entries of the estimated S will be forced to 0.

backend

The backend parameter of lrpsadmm. It is one of 'R' or 'RcppEigen'.

Details

Recall that the penalty for the LRpS estimator is written as λ_1 ||S||_1 + λ_2 Trace(L) in the objective function of lrpsadmm. This can be equivalently rewritten in terms of the regularisation parameters λ and γ as follows

λ γ ||S||_1 + λ (1 - γ) Trace(L),

for γ \in (0, 1).

For a given value of γ, one can perform cross-validation along the regularisation path in order to choose λ. This function computes the regularisation paths for each value of γ supplied as arguments and performs cross-validation. One can then select the pair (λ, γ) that produces the smallest cross-validated log-likelihood.

The function lrpsadmm is fitted for successive values of λ using warm starts. The sequence of values of λ can be provided directly by the user. It is automatically sorted in decreasing order. By default, a decreasing sequence of 20 values within a reasonable range is selected as follows. We set λ_{max} = \max_{ij, i \neq j} |Σ_{ij}|/γ and λ_{min} = λ_{max} * lambda.ratio; then 20 values between λ_{max} and λ_{min} are taken following a geometric progression.

Because it does not make much sense to fit this estimator when the sparse estimate S becomes too dense or if the rank of the low-rank estimate L becomes too high, the computation of the path is aborted when the sparsity of S reaches max.sparsity or when the rank of L reaches max.rank.

Recall that cross-validation overselects. It might be good for prediction purposes but methods such statibility selection are probably better if the support of S is what is of interest.

Value

An object of class lrpsadmmcv. It contains the values of the mean cross-validated log-likelihood, its standard deviation for each pair (lambda, gamma) and an object of class lrpsadmm.path for each value of gamma. See the examples for how to access the selected tuning paramters, best fit etc...

See Also

lrpsadmm lrpadmm.path

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
set.seed(0)
## 1 - A simple example: Well-powered dataset
sim.data <- generate.latent.ggm.data(n=2000, p=100, h=5, outlier.fraction = 0.0,
                                    sparsity = 0.02, sparsity.latent = 0.7)
ground.truth <- sim.data$precision.matrix[1:100, 1:100]
ground.truth <- 1 * (( ground.truth - diag(diag(ground.truth)) ) !=0)
X <- sim.data$obs.data;
gammas <- c(0.1, 0.15, 0.2)
# Let the function decide the range of value of Lambda
cvpath <- lrpsadmm.cv(X, gammas = gammas,
                         lambda.ratio = 1e-02, n.lambdas = 30, verbose = TRUE)
best.gamma <- cvpath$best.gamma
best.lambda <- cvpath$best.lambda
best.fit <- cvpath$best.fit$fit # Object of class lrpsadmm
plot(best.fit)
# The value Gamma = 0.15 is selected by X-validation
plot(cvpath) # Plot the outcome. Object of class lrpsadmmcv
# We can look at the path corresponding to this value
best.path <- cvpath$cross.validated.paths[[which(gammas == best.gamma)]]$cross.validated.path
# We know the ground truth, so let's use it to see how well we did:
plot(best.path, ground.truth = ground.truth)

## 2 - Data with outliers: use a robust estimator
set.seed(0)
sim.data <- generate.latent.ggm.data(n=2000, p=50, h=5, outlier.fraction = 0.05,
                                    sparsity = 0.02, sparsity.latent = 0.7)
ground.truth <- sim.data$precision.matrix[1:50, 1:50]
# Remove the elements along the diagonal. Keep a matrix of 0s and 1s
ground.truth <- 1 * (( ground.truth - diag(diag(ground.truth)) ) !=0)
X <- sim.data$obs.data;
# We can afford high values of gamma because n >> p
gammas <- c(0.2, 0.3, 0.4)
# Use the Kendall based estimator:
cvpath <- lrpsadmm.cv(X, gammas = gammas, covariance.estimator = Kendall.correlation.estimator,
                         lambda.ratio = 1e-03, n.lambdas = 30, verbose = TRUE)
plot(cvpath)
best.gamma <- cvpath$best.gamma
best.lambda <- cvpath$best.lambda
best.path <- cvpath$cross.validated.paths[[which(gammas == best.gamma)]]$cross.validated.path
plot(best.path, ground.truth = ground.truth)

## 3 - A tougher problem n is close to p
set.seed(0)
sim.data <- generate.latent.ggm.data(n=150, p=100, h=5, outlier.fraction = 0.0,
                                     sparsity = 0.02, sparsity.latent = 0.7)
ground.truth <- sim.data$precision.matrix[1:100, 1:100]
# Remove the elements along the diagonal. Keep a matrix of 0s and 1s
ground.truth <- 1 * (( ground.truth - diag(diag(ground.truth)) ) !=0)
X <- sim.data$obs.data;
# Since n < p, do not try too high values of gamma. Stay close to 0.
gammas <- c(0.07, 0.1, 0.12)
cvpath <- lrpsadmm.cv(X, gammas = gammas,
                         lambda.ratio = 0.1, n.lambdas = 20, verbose = TRUE)
plot(cvpath) # Plot the outcome

# Clearly the range selected by the function is not good enough
# We need better values for lambda. In that case it is better
# to see which value of lambda yields a very sparse graph
# for a given gamma:
gammas <- c(0.1)
# We set the seed so we can compre the x-validated log-likelihoods between
# two runs of the function
cvpath <- lrpsadmm.cv(X, gammas = gammas, lambda.max = 2.2,
                        lambda.ratio = 0.1, n.lambdas = 20, verbose = TRUE, seed = 0)
plot(cvpath)
gammas <- c(0.12)
cvpath <- lrpsadmm.cv(X, gammas = gammas, lambda.max = 1.7,
                         lambda.ratio = 0.1, n.lambdas = 20, verbose = TRUE, seed = 0)
plot(cvpath)

benjaminfrot/lrpsadmm documentation built on Oct. 19, 2019, 8:13 a.m.