opts_chunk$set(cache=TRUE, autodep=TRUE)


The Clever package implements the PCA leverage outlier detection method for high-dimensional (HD) data, as detailed in this manuscript:

Citation: Mejia, Amanda F., Mary Beth Nebel, Ani Eloyan, Brian Caffo, and Martin A. Lindquist. "PCA leverage: outlier detection for high-dimensional functional magnetic resonance imaging data." Biostatistics 18, no. 3 (2017): 521-536. paper link

In summary, the manuscript proposes a method to detect outlier observations in HD data by drawing on the traditional statistical ideas of PCA, leverage, and outlier detection. While the primary application is for detecting outlying time points in an fMRI scan, the method can also be applied to other forms of HD data, such as gene expression data.

Method Outline

As input, the algorithm takes a T x V matrix, Y. In our case, Y represents an fMRI run, where each row of Y is a vectorized volume, and each column represents one timepoint. Next, the algorithm performs the following steps:

  1. Normalize the Y matrix.

  2. Perform PCA on the normalized Y matrix using singular value decomposition (SVD), in order to obtain the PC score matrix, U (of dimension T x T).

  3. To reduce the dimensions, retain the first Q rows of the U matrix corresponding to the first Q < T principal components. We will refer to this submatrix as A (of dimension T x Q). Note: To choose the model order Q, we retain only components with a greater-than-average eigenvalue for the "mean" method, or components with kurtosis greater than 2 for the "kurtosis" method. The number kept will be further restricted if robust distance will be used, since it requires T to be appropriately large relative to Q for estimation of the covariance matrix.

  4. Now we can apply outlier detection on A. The primary method is PCA leverage, though we also propose an alternative called robust distance (see paper for further details). The output of either of these outlier detection methods is a T x 1 vector representing the "outlyingness" of each time point.

  5. Thresholds are used to identify the outliers. We choose 3 thresholds, with increasing level of stringency. Our function outputs the outliers associated with each threshold.


Install the package from GitHub and load it:


Tutorial Data

ABIDE is a publicly available resource of neuroimaging and phenotypic information from 1112 subjects consisting of 20 datasets collected at 16 sites (Di Martino and others, 2014). Our simulated dataset is based on resting-state fMRI scans from two subjects collected as part of the ABIDE dataset. The first dataset contains a number of artifacts; the second is relatively artifact-free.

A Simple Example

Here, we will run through a simple example. First let's pull the data, as follows:


The fMRI data for both subjects has already had a brain mask applied has been vectorized to form a $T\times V$ (time by voxels or vertices) data matrix.


We next run clever on both datasets using all possible combinations of parameters.

clever.Dat1.mean.lev = clever(Dat1)
clever.Dat2.mean.lev = clever(Dat2)
clever.Dat1.kurt.lev = clever(Dat1, choosePCs = 'kurtosis')
clever.Dat1.mean.rds = clever(Dat1, method = 'robdist_subset')
clever.Dat1.kurt.rds = clever(Dat1, choosePCs = 'kurtosis', method = 'robdist_subset')
clever.Dat1.mean.rbd = clever(Dat1, method = 'robdist')
clever.Dat1.kurt.rbd = clever(Dat1, choosePCs = 'kurtosis', method = 'robdist')

clever.Dat2.kurt.lev = clever(Dat2, choosePCs = 'kurtosis')
clever.Dat2.mean.rds = clever(Dat2, method = 'robdist_subset')
clever.Dat2.kurt.rds = clever(Dat2, choosePCs = 'kurtosis', method = 'robdist_subset')
clever.Dat2.mean.rbd = clever(Dat2, method = 'robdist')
clever.Dat2.kurt.rbd = clever(Dat2, choosePCs = 'kurtosis', method = 'robdist')

clevers.Dat1 = list(clever.Dat1.mean.lev, clever.Dat1.kurt.lev, clever.Dat1.mean.rds, clever.Dat1.kurt.rds, clever.Dat1.mean.rbd, clever.Dat1.kurt.rbd)

clevers.Dat2 = list(clever.Dat2.mean.lev, clever.Dat2.kurt.lev, clever.Dat2.mean.rds, clever.Dat2.kurt.rds, clever.Dat2.mean.rbd, clever.Dat2.kurt.rbd)

There were zero-variance voxels in both files, but neither had so much as to warrant concern.

Here are the outlier distributions for the first dataset:


plt_cell = vector('list', length=6)
plt_row = vector('list', length=3)
for(j in 0:2){
  for(i in 1:2){
    plt_cell[[j*2 + i]] = plot(clevers.Dat1[[j*2 + i]], type='n')
  r = c(plt_cell[(j*2+1):(j*2+2)], common.legend=T, legend='bottom', align='hv', ncol=2)
  plt_row[[j + 1]] =, r)

grid.arrange(grobs=plt_row, ncol=1)

And for the second:

plt_cell = vector('list', length=6)
plt_row = vector('list', length=3)
for(j in 0:2){
  for(i in 1:2){
    plt_cell[[j*2 + i]] = plot(clevers.Dat2[[j*2 + i]], type='n')
  r = c(plt_cell[(j*2+1):(j*2+2)], common.legend=T, legend='bottom', align='hv', ncol=2)
  plt_row[[j + 1]] =, r)

grid.arrange(grobs=plt_row, ncol=1)

For the first dataset, clever identifies a few outlying time ranges consistently across all parameter settings. The exception is kurtosis and leverage, where the same time ranges do have high leverage, but are not labeled as outliers since they do not surpass the cutoffs. For the second dataset, some parameter settings identify a handful of outliers, but most do not. Of those outliers identified, their time ranges are not consistent, and their outlyingness measures are fairly low. Overall, these results are consistent with our prior knowledge of both datasets.

In fact, We can reconstruct the original fMRI images with the mask used to obtain their vectorized matrix representations.

Mask1 = readNIfTI('../data/Dat1_mask.nii.gz') #Pitt_0050048 (full of artifacts)

Matrix_to_VolumeTimeSeries = function(mat, mask){
  t = nrow(mat)
  dims = c(dim(mask), t)
  in.mask = mask > 0
  nif = array(0, dim=dims)
  for(i in 1:t){
    nif[,,,i][in.mask] = mat[i,]

Img1 = Matrix_to_VolumeTimeSeries(Dat1, Mask1)

Below, we look at sagittal midsections of the first data set. We will compare the timepoint of median leverage (first) to the timepoint of maximum leverage (second). We choose to use the mean and leverage parameter settings.

Volume_to_NIfTI = function(VolumeTimeSeries, time, mask){
  vol = VolumeTimeSeries[,,,time]
  vol = copyNIfTIHeader(img=mask, arr=vol)

levs = clever.Dat1.mean.lev$leverage
t_med = order(levs)[ceiling(length(levs)/2)]
t_high = which.max(levs)
times = c(t_med, t_high)
z = floor(dim(Mask1)[2]/2)
plots = vector(mode='list', length(times))
for(i in 1:length(times)){
  image(Volume_to_NIfTI(Img1, times[i], Mask1), z=z, plane='sagittal', plot.type='single')

The median time point appears normal, whereas the most outlying time point clearly has banding artifacts.

muschellij2/clever documentation built on May 15, 2019, 3:16 p.m.