In muschellij2/clever: Calculate Leverage of Component Models

library(knitr)
opts_chunk$set(autodep = TRUE, cache = TRUE)

Overview

The Clever package implements the PCA leverage outlier detection method for high-dimensional (HD) data, as detailed in this manuscript:

Citation: Mejia, Amanda F., Mary Beth Nebel, Ani Eloyan, Brian Caffo, and Martin A. Lindquist. "PCA leverage: outlier detection for high-dimensional functional magnetic resonance imaging data." Biostatistics 18, no. 3 (2017): 521-536. paper link

In summary, the manuscript proposes a method to detect outlier observations in HD data by drawing on the traditional statistical ideas of PCA, leverage, and outlier detection. While the primary application is for detecting outlying time points in an fMRI scan, the method can also be applied to other forms of HD data, such as gene expression data.

Method Outline

As input, the algorithm takes a T x V matrix, Y. In our case, Y represents an fMRI run, where each row of Y is a vectorized volume, and each column represents one timepoint. Next, the algorithm performs the following steps:

Normalize the Y matrix.
Perform PCA on the normalized Y matrix using singular value decomposition (SVD), in order to obtain the PC score matrix, U (of dimension T x T).
To reduce the dimensions, retain the first Q rows of the U matrix corresponding to the first Q < T principal components. We will refer to this submatrix as A (of dimension T x Q). Note: To choose the model order Q, we retain only components with a greater-than-average eigenvalue for the "mean" method, or components with kurtosis greater than 2 for the "kurtosis" method. The number kept will be further restricted if robust distance will be used, since it requires T to be appropriately large relative to Q for estimation of the covariance matrix.
Now we can apply outlier detection on A. The primary method is PCA leverage, though we also propose an alternative called robust distance (see paper for further details). The output of either of these outlier detection methods is a T x 1 vector representing the "outlyingness" of each time point.
Thresholds are used to identify the outliers. We choose 3 thresholds, with increasing level of stringency. Our function outputs the outliers associated with each threshold.

Installation

Install the package from GitHub and load it:

devtools::install_github('mandymejia/clever')

library(clever)

Tutorial Data

ABIDE is a publicly available resource of neuroimaging and phenotypic information from 1112 subjects consisting of 20 datasets collected at 16 sites (Di Martino and others, 2014). Our simulated dataset is based on resting-state fMRI scans from two subjects collected as part of the ABIDE dataset. The first dataset contains artifacts toward the beginning time point; the second is relatively artifact-free. Axial slices are used instead of the entire volumes to minimize the clever package's download time.

A Simple Example

Here, we will run through a simple example. First let's pull the data, as follows:

data(Dat1)
data(Dat2)

The fMRI data for both subjects consist of a single slice from a volume. A brain mask has been applied to vectorize the data, forming a $T\times V$ (time by voxels or vertices) data matrix.

dim(Dat1)
dim(Dat2)

We next run clever on both datasets. We could just measure leverage using the variance-based PC projection:

clever.Dat1.var.lev = clever(Dat1, verbose=TRUE, DVARS=FALSE, lev_images=FALSE)
clever.Dat2.var.lev = clever(Dat2, verbose=TRUE, DVARS=FALSE, lev_images=FALSE)

plot(clever.Dat1.var.lev, plot_title="Dat1", show.legend=FALSE)

plot(clever.Dat2.var.lev, plot_title="Dat2", show.legend=FALSE)

We could also use all combinations of projection and outlyingness methods and compare them:

clever.Dat1 = clever(Dat1, projection="all", out_meas="all", 
                     lev_images=TRUE, verbose=TRUE)
clever.Dat2 = clever(Dat2, projection="all", out_meas="all", 
                     lev_images=TRUE, verbose=TRUE)

Here are the outliers for the first dataset:

plot(clever.Dat1, "all", plot_title="Dat1")

And for the second:

plot(clever.Dat2, "all", plot_title="Dat2")

For the first dataset, clever identifies outliers at timepoints 59-61 and 150-151 consistently across most methods. A few other time points are flagged as well. There seems to be strong agreement between leverage scrubbing and DVARS. Weaker outliers around the 40th time point are identified in the second dataset, and DVARS flags a couple frames in this region as well. Overall, these results are consistent with our prior knowledge of both datasets.

Image reconstruction

We can reconstruct the original fMRI images with the mask used for vectorizing the data. See Matrix_to_VolumeTimeSeries in clever/R/visualize.R for a helper function to do this.

library(oro.nifti)
library(neurobase)

#'  Selects a timepoint from a volume time series, and returns it after adding
#'  the NIfTI header from the mask onto it.
#' @param VolumeTimeSeries A 4D matrix. Time is on the 4th dimension.
#' @param time The timepoint to select.
#' @param mask The corresponding mask.
#'
#' @return The 3D volume with the NIfTI header from the mask.
Volume_to_NIfTI <- function(VolumeTimeSeries, time, mask){
  vol <- VolumeTimeSeries[,,,time]
  vol <- copyNIfTIHeader(img=mask, arr=vol)
  return(vol)
}

fname = system.file("extdata", "Dat1_mask.nii.gz", package = "clever")
Mask1 = readNIfTI(fname) #Pitt_0050048 (full of artifacts)
Img1 = Matrix_to_VolumeTimeSeries(Dat1, Mask1)

fname = system.file("extdata", "Dat2_mask.nii.gz", package = "clever")
Mask2 = readNIfTI(fname)
Img2 = Matrix_to_VolumeTimeSeries(Dat2, Mask2)

Below, we compare the timepoint of median leverage (first) to the timepoint of maximum leverage (second). We choose to use the kurtosis and leverage parameter settings.

par(mfrow=c(1,2))
levs = clever.Dat1$outlier_measures$PCA_var__leverage
t_med = order(levs)[ceiling(length(levs)/2)]
t_max = which.max(levs)

image(Img1[,,,t_med], main=paste0('Median leverage (T = ', t_med, ')'))
image(Img1[,,,t_max], main=paste0('Maximum leverage (T = ', t_max, ')'))

The median time point appears normal, whereas the most outlying time point clearly has banding artifacts.

Leverage images

clever can also display the "leverage images" for each outlying observation. There are two types: the composite of the selected PC directions, weighed by the scores for that observation (without scaling by variance), and the single PC direction with the highest score at that observation. To solve for these images, be sure to use lev_images=TRUE (already the default option).

Lev_Img1 = clever.Dat1$outlier_lev_imgs$PCA_var__leverage
print('The timepoints meeting the first outlier level threshold:')
paste(row.names(Lev_Img1$mean), collapse=", ")

Here are the leverage images at the 59th timepoint for the first dataset:

par(mfrow=c(1,2))

# Constant voxels are deleted during the clever algorithm, so the leverage images will have
# missing values where the constant voxels were. The NA_fill option is used here to make
# these voxels have the same color as the background (out-of-mask voxels).
Lev_Img1.mean = Matrix_to_VolumeTimeSeries(Lev_Img1$mean, Mask1, NA_fill=0)
Lev_Img1.top = Matrix_to_VolumeTimeSeries(Lev_Img1$top, Mask1, NA_fill=0)

image(Lev_Img1.mean[,,,1], main=paste0('Lev. img., mean dir. (T=59)'))
image(Lev_Img1.top[,,,1], main=paste0('Lev. img., top dir. (T=59)'))