library(pander) library(abind) write.out <- T output.dir <- params$output.dir preview <- params$is.preview me <- params$meta.data$name this.date <- params$date this.data <- params$data this.title <- params$meta.data$title if(this.title == '') { this.title <- paste0("Outliers Summary Report") } this.subtitle <- params$meta.data$subtitle show_gmcd <- show_mcd <- show_mcd.error <- F if(params$meta.data$mcd.path){ if(params$meta.data$continuous){ show_mcd <- T }else { show_gmcd <- T } }else{ show_mcd.error <- T } show_pca <- show_ca <- show_ord.error <- F if(params$meta.data$continuous){ show_pca <- T }else{ show_ca <- T }
title: "r this.title
"
subtitle: "r this.subtitle
"
author: "r me
"
date: "r this.date
"
These analyses were performed on r params$meta.data$data.name
, and included the following variables:
r paste(unlist(colnames(params$data)), collapse="\t")
.
r if(show_pca){"Principal components analysis (PCA) is a multivariate technique for matrices where, typically, rows are observations and columns are measures/variables. PCA is applied, generally, to data assumed to be continuous. PCA produces orthogonal components that are rank-ordered by their explained variance. From the PCA, each observation is assigned a score \\textit{per component} called a component score. Here, we use the singular vectors associated with the observations. Singular vectors have a mean of 0 and a norm of 1 (i.e., unity). Component scores reflect how much variance observations contribute to components (very close to zero is little variance, very far from zero is a lot of variance). The further an observation is from zero, the more likely they contribute a high amount of variance or are an outlier."}
r if(show_ca){"Correspondence analysis (CA) is a multivariate technique akin to principal components analyses, but designed for data that are generally non-negative (i.e., categorical, ordinal, contingency). Here CA is applied to a matrix with observations on the rows and measures/variables on the columns. CA produces orthogonal components that are rank-ordered by their explained variance. From the CA, each observation is assigned a score \\textit{per component} called a component score. Here, we use the singular vectors associated with the observations. Singular vectors have a mean of 0 and a norm of 1 (i.e., unity). Component scores reflect how much variance observations contribute to components (very close to zero is little variance, very far from zero is a lot of variance). The further an observation is from zero, the more likely they contribute a high amount of variance or are an outlier."}
r if(show_ord.error){"ERROR: Unrecognized ordination type."}
Ordination analyses are used as a supplement to the outlier analyses as it has been shown that ordination techniques (e.g., principal components, or correspondence analysis) can highlight outliers in one of two ways: (1) distant individuals on the first two components or (2) distant individuals on the last two components. Simple ordination techniques can help identify individuals and possible reasons why.
These analyses are conducted in three stages to identify outliers and sources of their outlierness as follows:
r if(show_mcd){"1. The minimum covariance determinant (MCD) technique was used to find the subset of observations with the smallest amount of scatter (i.e., smallest determinant) on data that were assumed to be strictly continuous, and thus each column was centered and/or z-scored,"}
r if(show_gmcd){"1. The generalized minimum covariance determinant (GMCD) technique was used to find the subset of observations with the smallest amount of scatter (i.e., smallest determinant) on data that that were not strictly continuous (e.g., categorical, ordinal, counts, or mixtures of these and continuous), and preprocessed under the assumptions of CA (just as above in ordination),"}
r if(show_mcd){"3. The CorrMax transformation to find potential sources (i.e., variables) of outlierness for each outlier; under the same preprocessing as in step 1"}
r if(show_gmcd){"3. The generalized CorrMax transformation to find potential sources (i.e., variables) of outlierness for each outlier; under the same preprocessing as in step 1"}
Through this stages and based on the user selected thresholds, individuals could be identified as: inliers, outliers on standard Mahalanobis distance, outliers on robust Mahalanobis distance, or outliers on both standard and robust Mahalanobis distances. Outliers are the most unique cases and it is essential to look into these individuals. It is strongly recommended to also inspect all types of outliers---especially those for "both".
For additional background on the MCD algorithm and CorrMax procedures see Hubert et al., (2010) and (), respectively. For an overview of the ONDRI outlier process see Sunderland et al., (2019). For the generalized extension of the MCD (with some of the background for generalized CorrMax), as well as bootstrapping see Beaton et al., (2018).
``` {r,echo=FALSE, results="asis"} analyses.description.table <- cbind( paste(params$all.res$data.nrow, params$all.res$data.ncol, sep = ", "), params$meta.data$mcd.iterations, paste0(params$meta.data$mcd.threshold100, "%"), paste0(params$meta.data$corrmax.threshold, "%"), params$meta.data$bootstrap.iterations, paste0(params$meta.data$bootstrap.threshold.robust100, "%"), paste0(params$meta.data$bootstrap.threshold.standard*100, "%") )
colnames(analyses.description.table) <- c("n, p","MCD iterations","MCD threshold","CorrMax threshold","Bootstrap Iterations", "Robust Bootstrap threshold", "Standard Bootstrap threshold") rownames(analyses.description.table) <- c("Analysis")
pandoc.table(analyses.description.table,split.cells=10,keep.line.breaks=T,caption="Multivariate outlier analyses performed")
# Results ## Ordination analyses ```r params$ordination.plot
Ordination analysis results. The first two components explain the most variance and show individuals that contribute a large amount of overall variance. The last two components explain the least variance but importantly show both variables that are generally orthogonal and individuals that are outliers because of those variables.
r if(show_gmcd) {"G"}
MCD analysesif(!is.null(params$primary.plot)){ outliers.design <- params$all.res$mcd.results$outliers.design outliers.sums <- colSums(outliers.design) outliers.sums["Robust"] <- outliers.sums["Robust"] - outliers.sums["Both"] outliers.sums["Standard"] <- outliers.sums["Standard"] - outliers.sums["Both"] liers.table <- cbind(outliers.sums,(outliers.sums/params$all.res$data.nrow)*100) colnames(liers.table) <- c("N","%") pandoc.table(liers.table,split.cells=10,keep.line.breaks=T,caption=sprintf("%s - outlier categories", paste(params$meta.data$data.name)), round=2) }
if (is.null(params$meta.data$primary.plot.skip)) { params$primary.plot }
r if(!is.null(params$meta.data$primary.plot.skip)) {params$meta.data$primary.plot.skip}
r if(show_gmcd) {"G"}
CorrMax analysesif (is.null(params$meta.data$corrmax.plot.skip)) { params$corrmax.plot }
r if(!is.null(params$meta.data$corrmax.plot.skip)) {params$meta.data$corrmax.plot.skip}
if(!preview && !is.null(params$primary.plot)) { all.res <- params$all.res mcd.results <- all.res$mcd.results rob.md <- mcd.results$dists$robust_mahal_dists stand.md <- mcd.results$dists$mahal_dists md.scores <- sqrt(cbind(stand.md, rob.md)) # store mcd outliers rownames(md.scores) <- rownames(params$data, do.NULL = F, prefix = "SUBJECT_") # add rownames if non exist corrmax.res <- all.res$corrmax.res$percentage_contributions rownames(corrmax.res) <- rownames(md.scores) # make rownames consistent with mcd outliers outliers.design.reordered <- outliers.design rownames(outliers.design.reordered) <- rownames(md.scores) # make rownames consistent with mcd outliers # reorder corrmax results in decreasing order of robust values (same for the outlier/variable ranking tables) corrmax.rob_reordered <- corrmax.res[order(md.scores[, "rob.md"],decreasing = T),] corrmax.stand_reordered <- corrmax.res[order(md.scores[, "rob.md"],decreasing = T),] outliers.design.reordered <- outliers.design.reordered[rownames(corrmax.rob_reordered), ] # remove corrmax results under the neither category corrmax.rob_reordered_drop <- corrmax.rob_reordered[-c(which(outliers.design.reordered[rownames(corrmax.rob_reordered),"Neither"]==1)), , drop = F] corrmax.stand_reordered_drop <- corrmax.stand_reordered[-c(which(outliers.design.reordered[rownames(corrmax.stand_reordered),"Neither"]==1)), , drop = F] # fill outlier ranking table and reorder on Both decreasing outlier.ranks <- cbind(round(md.scores[rownames(corrmax.rob_reordered_drop), "rob.md", drop = F],digits=2), round(md.scores[rownames(corrmax.stand_reordered_drop), "stand.md", drop = F],digits=2), outliers.design.reordered[rownames(corrmax.rob_reordered_drop), c("Both", "Robust", "Standard"), drop = F]) outlier.ranks <- outlier.ranks[order(outlier.ranks[,"Both"], decreasing = T), , drop = F] colnames(outlier.ranks)[1] <- "Sqrt Robust MD" colnames(outlier.ranks)[2] <- "Sqrt Standard MD" outlier.ranks[,c("Both", "Robust", "Standard")] <- ifelse(outlier.ranks[,c("Both", "Robust", "Standard")]==1,"YES","") # sort by variable ranking corrmax.rob_reordered_drop <- corrmax.rob_reordered_drop[rownames(outlier.ranks), , drop = F] # Between this line and line 222 are unchecked for bugs when there is only a single outlier top.contributions.per.obs <- plyr::alply(corrmax.rob_reordered_drop,1,function(i){sort(i[which(i >= params$meta.data$corrmax.threshold), drop = F],decreasing=T)}) names(top.contributions.per.obs) <- row.names(corrmax.rob_reordered_drop) # bind variable ranking output.temp <- lapply(top.contributions.per.obs,function(i){data.frame(VARIABLE=names(i),MAXIMUM_PROPORTION=i)}) output.structure <- do.call("rbind",output.temp) output.structure$SUBJECT <- rep(names(output.temp),unlist(lapply(output.temp,nrow))) rownames(output.structure) <- NULL ## cleanup rm(output.temp) output.structure <- output.structure[,c("SUBJECT","MAXIMUM_PROPORTION","VARIABLE")] output.structure$MAXIMUM_PROPORTION <- round(output.structure$MAXIMUM_PROPORTION,digits = 2) output.structure$VARIABLE <- gsub("^\\.","",output.structure$VARIABLE) # output outlier/variable ranking if(write.out){ write.csv(output.structure, paste0(output.dir, "/", params$meta.data$data.name,"_NIBS_OUTL_VariableRanking_",this.date,".csv" ),row.names = F) write.csv(outlier.ranks, paste0(output.dir, "/", params$meta.data$data.name,"_NIBS_OUTL_OutlierRanking_",this.date,".csv" ),row.names = T) } }
# Additional Information #r params$meta.data$custom.descriptor
Beaton, D., Sunderland, K. M., Levine, B., Mandzia, J., Masellis, M., Swartz, R. H., ... & Strother, S. C. (2019). Generalization of the minimum covariance determinant algorithm for categorical and mixed data types. bioRxiv, 333005.
Garthwaite, P. H., & Koch, I. (2016). Evaluating the contributions of individual variables to a quadratic form. Australian & New Zealand journal of statistics, 58(1), 99-119.
Hubert, M., & Debruyne, M. (2010). Minimum covariance determinant. Wiley Interdisciplinary Reviews: Computational Statistics, 2 (1), 36–43.
Sunderland, K. M., Beaton, D., Fraser, J., Kwan, D., McLaughlin, P. M., Montero-Odasso, M., ... & Strother, S. C. (2019). The utility of multivariate outlier detection techniques for data quality evaluation in large studies: an application within the ONDRI project. BMC medical research methodology, 19(1), 102.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.