clever: Identify outliers with 'clever'

Description Usage Arguments Details Value Examples

View source: R/clever.R

Description

Calculates PCA leverage or robust distance and identifies outliers.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
clever(
  X,
  projection = "PCA_kurt",
  out_meas = "leverage",
  DVARS = TRUE,
  detrend_PCs = TRUE,
  PCATF_kwargs = NULL,
  kurt_quantile = 0.95,
  kurt_detrend = TRUE,
  id_outliers = TRUE,
  lev_cutoff = 4,
  rbd_cutoff = 0.9999,
  lev_images = FALSE,
  verbose = FALSE
)

Arguments

X

Numerical data matrix. Should be wide (N observations x P variables, N >> P).

projection

Character vector indicating the projection methods to use. Choose at least one of the following: "PCA_var" for PCA + variance, "PCA_kurt" for PCA + kurtosis, and "PCATF" for PCA Trend Filtering + variance. Or, use "all" to use all projection methods. Default: c("PCA_kurt").

out_meas

Character vector indicating the outlyingness measures to compute. Choose at least one of the following: "leverage" for leverage, or "robdist" for robust distance. Or, use "all" to use both methods. Default: c("leverage").

DVARS

Should DVARS (Afyouni and Nichols, 2017) be computed too? Default is TRUE.

detrend_PCs

Detrend all PCs before computing leverage or robust distance? Default: TRUE.

Detrending is recommended for time-series data, especially if there are many time points or changing circumstances, such as in task-based fMRI. Detrending should not be used with non-time-series data because the observations are not temporally related.

PCATF_kwargs

Named list of arguments for PCATF projection method. Only applies if ("PCATF" %in% projection).

Valid entries are:

K

maximum number of PCs to compute (Default: 1000)

lambda

trend filtering parameter (Default: 0.05)

niter_max

maximum number of iterations (Default: 1000)

verbose

Print updates? (Default: FALSE)

kurt_quantile

What cutoff quantile for kurtosis should be used? Only applies if ("PCA_kurt" %in% projection). Default: 0.95.

kurt_detrend

Should the PCs be detrended before measuring kurtosis? Only applies if ("PCA_kurt" %in% projection). Default: TRUE.

Detrending is highly recommended for time-series data, because trends can induce high kurtosis even in the absence of outliers. Detrending should not be done with non-time-series data because the observations are not temporally related.

id_outliers

Should the outliers be identified? Default: TRUE.

lev_cutoff

The outlier cutoff value for leverage, as a multiple of the median leverage. Only used if "leverage" %in% projection and id_outliers. Default: 4, or 4 * median.

rbd_cutoff

The outlier cutoff quantile for MCD distance. Only used if "robdist" %in% projection and id_outliers. Default: 0.9999, for the 0.9999 quantile.

The quantile is computed from the estimated F distribution.

lev_images

Should leverage images be computed? If FALSE memory is conserved. Default: FALSE.

verbose

Should occasional updates be printed? Default: FALSE.

Details

clever will use all combinations of the requested projection and out_meas methods that make sense. For example, if projection=c("PCATF", "PCA_var", "PCA_kurt") and out_meas=c("leverage", "robdist") then these five combinations will be used: PCATF with leverage, PCA + variance with leverage, PCA + variance with robust distance, PCA + kurtosis with leverage, and PCA + kurtosis with robust distance. Each method combination will yield its own out_meas time series.

Value

A clever object, i.e. a list with components

params

A list of all the arguments used.

projections
PC_var
indices

The indices retained from the original SVD projection to make the variance-based PC projection.

PCs

The PC projection.

PC_kurt
indices

The indices retained from the original SVD projection to make the kurtosis-based PC projection. They are ordered from highest kurtosis to lowest kurtosis.

PCs

The PC projection. PCs are ordered in the standard way, from highest variance to lowest variance, instead of by kurtosis.

PCATF
indices

The indices of the trend-filtered PCs used to make the projection.

PCs

The PCATF result.

outlier_measures
PC_var__lev

The leverage values for the PC_var projection.

PC_kurt__lev

The leverage values for the PC_kurt projection.

PCATF__lev

The leverage values for the PCATF projection.

PC_var__rbd

The robust MCD distance values for the PC_var projection.

PC_kurt__rbd

The robust MCD distance values for the PC_kurt projection.

DVARS_DPD

The Delta percent DVARS values.

DVARS_ZD

The DVARS z-scores.

outlier_cutoffs
lev

The leverage cutoff for outlier detection: lev_cutoff times the median leverage.

MCD

The robust distance cutoff for outlier detection: the rbd_cutoff quantile of the estimated F distribution.

DVARS_DPD

The Delta percent DVARS cutoff: +/- 5 percent

DVARS_ZD

The DVARS z-score cutoff: the one-sided 5 percent significance level with Bonferroni FWER correction.

outlier_flags
PC_var__leverage

Logical vector idnicating whether each observation surpasses the outlier cutoff.

PC_kurt__leverage

Logical vector idnicating whether each observation surpasses the outlier cutoff.

PCATF__leverage

Logical vector idnicating whether each observation surpasses the outlier cutoff.

PC_var__robdist

Logical vector idnicating whether each observation surpasses the outlier cutoff.

PC_kurt__robdist

Logical vector idnicating whether each observation surpasses the outlier cutoff.

DVARS_DPD

Logical vector idnicating whether each observation surpasses the outlier cutoff.

DVARS_ZD

Logical vector idnicating whether each observation surpasses the outlier cutoff.

robdist_info
PC_var__robdist
inMCD

Logical vector indicating whether each observation was in the MCD estimate.

outMCD_scale

The scale for out-of-MCD observations.

Fparam

Named numeric vector: c, m, df1, and df2.

PC_var__robdist
inMCD

Logical vector indicating whether each observation was in the MCD estimate.

outMCD_scale

The scale for out-of-MCD observations.

Fparam

Named numeric vector: c, m, df1, and df2.

MCD_scale

The scale value for out-of-MCD observations, and NA for in-MCD observations. NULL if method is not robust distance.

lev_images
mean

The average of the PC directions, weighted by the unscaled PC scores at each outlying time point (U[i,] * V^T). Row names are the corresponding time points.

top

The PC direction with the highest PC score at each outlying time point. Row names are the corresponding time points.

top_dir

The index of the PC direction with the highest PC score at each outlying time point. Named by timepoint.

Examples

1
2
3
4
5
n_voxels = 1e4
n_timepoints = 100
X = matrix(rnorm(n_timepoints*n_voxels), ncol = n_voxels)

clev = clever(X)

muschellij2/clever documentation built on Sept. 26, 2020, 3:54 p.m.