View source: R/perform.metabolite.qc.R
perform.metabolite.qc | R Documentation |
This function is a wrapper function that performs the key quality controls steps on a metabolomics data set
perform.metabolite.qc( wdata, fmis = 0.2, smis = 0.2, tpa_out_SD = 5, outlier_udist = 5, outlier_treatment = "leave_be", winsorize_quantile = 1, tree_cut_height = 0.5, PC_out_SD = 5, feature_colnames_2_exclude = NA, derived_colnames_2_exclude = NA )
wdata |
the metabolite data matrix. samples in row, metabolites in columns |
fmis |
defaulted at 0.2, this defines the feature missingness cutoff |
smis |
defaulted at 0.2, this defines the sample missingness cutoff |
tpa_out_SD |
defaulted at 5, this defines the number of standard deviation from the mean in which samples will be excluded for total peak area. Pass NA to this paramater to exclude this step. |
outlier_udist |
defaulted at 5, this defines the number of interquartile range units from the median to define a value as an outlier |
outlier_treatment |
defaulted to "leave_be". Choices are "leave_be", "winsorize", or "turn_NA", which defines how to treat outlier values prior to estimating principal components. "leave_be" will do nothing to ourlier values. "turn_NA" will turn outliers into NA and thus be median imputed for the purpose of the PCA. "winsorize" will turn NA values into the "winsorize_quantile" of all remaining (non-outlying) values at a feature. |
winsorize_quantile |
the quantile (0-1) to winzorise outlier values to, if "outlier_treatment" parameter set to "winsorize". Defaulted to 1, or the maximum value of all remaining (non-outlying) values at a feature. |
tree_cut_height |
The height at which to cut the feature|metabolite dendrogram to identify "independent" features. tree_cut_height is 1-absolute(Spearman's Rho) for intra-cluster correlations. |
PC_out_SD |
defaulted at '5', this defines the number of standard deviation from the mean in which samples will be excluded for principle components. NA is NOT an excepted paramater. |
feature_colnames_2_exclude |
names of columns to be excluded from all analysis, such as for Xenobiotics. Pass NA to this parameter to exclude this step. |
derived_colnames_2_exclude |
names of columns to exclude from all sample QC steps, including independent feature identification, which is used to generate the sample PCA. |
a list object of: (1) "wdata" qc'd data matrix, (2) "featuresumstats" a list with a (1:"table") data frame of feature summary statistics and a (2:"tree") hclust object (3) "pca" a list with a (1:"pcs") data frame of the top 10 PCs and a binary for outliers on the top 2 PCs at 3,4, and 5 SD from the mean, a (2:"varexp") vector of the variance explained for each PC, an estimate of the number of 'significant' PCs as determined by (3:"accelerationfactor") an acceleration factor and (4:"nsig_parrallel") a parrallel analysis, and finally (5:"prob_pca") the top 10 PCs derived by a probabilistic PC analysis. (4) "exclusion_data" a matrix of exclusion summary statistics
## define a covariance matrix cmat = matrix(1, 4, 4 ) cmat[1,] = c(1, 0.8, 0.6, 0.2) cmat[2,] = c(0.8, 1, 0.7, 0.5) cmat[3,] = c(0.6, 0.7, 1, 0.6) cmat[4,] = c(0.2, 0.5, 0.6,1) ## simulate some correlated data (multivariable random normal) set.seed(1110) d1 = MASS::mvrnorm(n = 250, mu = c(5, 45, 25, 15), Sigma = cmat ) set.seed(1010) d2 = MASS::mvrnorm(n = 250, mu = c(5, 45, 25, 15), Sigma = cmat ) ## simulate some random data d3 = sapply(1:20, function(x){ rnorm(250, 40, 5) }) ## define the data set ex_data = cbind(d1,d2,d3) rownames(ex_data) = paste0("ind", 1:nrow(ex_data)) colnames(ex_data) = paste0("var", 1:ncol(ex_data)) ## add in some missingness ex_data[sample(1:length(ex_data), 450)] = NA ## add in some technical error to two samples m = apply(ex_data, 2, function(x){ mean(x, na.rm = TRUE) }) ex_data[c(1,10), ] = ex_data[1, ] + (m*0.00001) ## run the quality control example_qc = perform.metabolite.qc(ex_data) ## the filtered data is found in dim(example_qc$wdata) example_qc$wdata[1:5, 1:5] ## a data frame of summary statistics head( example_qc$featuresumstats$table ) ## a hclust dendrogram can be plotted plot( example_qc$featuresumstats$tree , hang = -1) abline(h = 0.5, col = "red", lwd = 1.5) ## (median imputed) PCs for all samples can be plotted pcol = c("blue","red")[ as.factor( example_qc$pca$pcs[, "PC1_5_SD_outlier"] )] plot(example_qc$pca$pcs[,"PC1"], example_qc$pca$pcs[,"PC2"], pch = 21, cex = 1.5, bg = pcol, xlab = paste0( "PC1: Var Exp = " , round(example_qc$pca$varexp[1], d = 4)*100, "%" ) , ylab = paste0( "PC2: Var Exp = " , round(example_qc$pca$varexp[2], d = 4)*100, "%" ) ) ## A Scree plot can be generated by plot( x = 1:length(example_qc$pca$varexp), y = example_qc$pca$varexp, type = "b", pch = 21, cex = 2, bg = "blue", xlab = "PC", ylab = "Variance Explained", main = "Scree Plot") abline(v = example_qc$pca$accelerationfactor, col = "red", lwd = 2) abline(v = example_qc$pca$nsig_parrallel, col = "green", lwd = 2) ## Probablistic PCs for all samples can be plotted plot(example_qc$pca$prob_pca[,1], example_qc$pca$prob_pca[,2], pch = 21, cex = 1.5, bg = pcol, xlab = "PC1", ylab = "PC2" ) ## A summary of the exclusion statistics example_qc$exclusion_data
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.