perform.metabolite.qc: perform metabolomics quality control

View source: R/perform.metabolite.qc.R

perform.metabolite.qcR Documentation

perform metabolomics quality control

Description

This function is a wrapper function that performs the key quality controls steps on a metabolomics data set

Usage

perform.metabolite.qc(
  wdata,
  fmis = 0.2,
  smis = 0.2,
  tpa_out_SD = 5,
  outlier_udist = 5,
  outlier_treatment = "leave_be",
  winsorize_quantile = 1,
  tree_cut_height = 0.5,
  PC_out_SD = 5,
  feature_colnames_2_exclude = NA,
  derived_colnames_2_exclude = NA
)

Arguments

wdata

the metabolite data matrix. samples in row, metabolites in columns

fmis

defaulted at 0.2, this defines the feature missingness cutoff

smis

defaulted at 0.2, this defines the sample missingness cutoff

tpa_out_SD

defaulted at 5, this defines the number of standard deviation from the mean in which samples will be excluded for total peak area. Pass NA to this paramater to exclude this step.

outlier_udist

defaulted at 5, this defines the number of interquartile range units from the median to define a value as an outlier

outlier_treatment

defaulted to "leave_be". Choices are "leave_be", "winsorize", or "turn_NA", which defines how to treat outlier values prior to estimating principal components. "leave_be" will do nothing to ourlier values. "turn_NA" will turn outliers into NA and thus be median imputed for the purpose of the PCA. "winsorize" will turn NA values into the "winsorize_quantile" of all remaining (non-outlying) values at a feature.

winsorize_quantile

the quantile (0-1) to winzorise outlier values to, if "outlier_treatment" parameter set to "winsorize". Defaulted to 1, or the maximum value of all remaining (non-outlying) values at a feature.

tree_cut_height

The height at which to cut the feature|metabolite dendrogram to identify "independent" features. tree_cut_height is 1-absolute(Spearman's Rho) for intra-cluster correlations.

PC_out_SD

defaulted at '5', this defines the number of standard deviation from the mean in which samples will be excluded for principle components. NA is NOT an excepted paramater.

feature_colnames_2_exclude

names of columns to be excluded from all analysis, such as for Xenobiotics. Pass NA to this parameter to exclude this step.

derived_colnames_2_exclude

names of columns to exclude from all sample QC steps, including independent feature identification, which is used to generate the sample PCA.

Value

a list object of: (1) "wdata" qc'd data matrix, (2) "featuresumstats" a list with a (1:"table") data frame of feature summary statistics and a (2:"tree") hclust object (3) "pca" a list with a (1:"pcs") data frame of the top 10 PCs and a binary for outliers on the top 2 PCs at 3,4, and 5 SD from the mean, a (2:"varexp") vector of the variance explained for each PC, an estimate of the number of 'significant' PCs as determined by (3:"accelerationfactor") an acceleration factor and (4:"nsig_parrallel") a parrallel analysis, and finally (5:"prob_pca") the top 10 PCs derived by a probabilistic PC analysis. (4) "exclusion_data" a matrix of exclusion summary statistics

Examples

## define a covariance matrix
cmat = matrix(1, 4, 4 )
cmat[1,] = c(1, 0.8, 0.6, 0.2)
cmat[2,] = c(0.8, 1, 0.7, 0.5)
cmat[3,] = c(0.6, 0.7, 1, 0.6)
cmat[4,] = c(0.2, 0.5, 0.6,1)
## simulate some correlated data (multivariable random normal)
set.seed(1110)
d1 = MASS::mvrnorm(n = 250, mu = c(5, 45, 25, 15), Sigma = cmat )
set.seed(1010)
d2 = MASS::mvrnorm(n = 250, mu = c(5, 45, 25, 15), Sigma = cmat )
## simulate some random data
d3 = sapply(1:20, function(x){ rnorm(250, 40, 5) })
## define the data set
ex_data = cbind(d1,d2,d3)
rownames(ex_data) = paste0("ind", 1:nrow(ex_data))
colnames(ex_data) = paste0("var", 1:ncol(ex_data))
## add in some missingness
ex_data[sample(1:length(ex_data), 450)] = NA
## add in some technical error to two samples
m = apply(ex_data, 2, function(x){ mean(x, na.rm = TRUE) })
ex_data[c(1,10), ] = ex_data[1, ] + (m*0.00001) 

## run the quality control
example_qc = perform.metabolite.qc(ex_data)

## the filtered data is found in 
dim(example_qc$wdata)
example_qc$wdata[1:5, 1:5]
## a data frame of summary statistics
head( example_qc$featuresumstats$table )
## a hclust dendrogram can be plotted
plot( example_qc$featuresumstats$tree , hang = -1)
abline(h = 0.5, col = "red", lwd = 1.5)
## (median imputed) PCs for all samples can be plotted
pcol = c("blue","red")[ as.factor( example_qc$pca$pcs[, "PC1_5_SD_outlier"] )]
plot(example_qc$pca$pcs[,"PC1"], example_qc$pca$pcs[,"PC2"], 
     pch = 21, cex = 1.5, bg = pcol, 
     xlab = paste0( "PC1: Var Exp = " , round(example_qc$pca$varexp[1], d = 4)*100, "%" ) , 
     ylab = paste0( "PC2: Var Exp = " , round(example_qc$pca$varexp[2], d = 4)*100, "%" ) )
## A Scree plot can be generated by
plot( x = 1:length(example_qc$pca$varexp), 
      y = example_qc$pca$varexp, 
      type = "b", pch = 21, cex = 2, bg = "blue",
      xlab = "PC", ylab = "Variance Explained", main = "Scree Plot")
abline(v = example_qc$pca$accelerationfactor, col = "red", lwd = 2)
abline(v = example_qc$pca$nsig_parrallel, col = "green", lwd = 2)
## Probablistic PCs for all samples can be plotted
plot(example_qc$pca$prob_pca[,1], example_qc$pca$prob_pca[,2], 
     pch = 21, cex = 1.5, bg = pcol, 
     xlab = "PC1", 
     ylab = "PC2" )
## A summary of the exclusion statistics
example_qc$exclusion_data


MRCIEU/metaboprep documentation built on Jan. 28, 2023, 7:29 p.m.