Separates Metabolites into Likely True Metabolites and Likely Measurement Artifacts

Share:

Description

Takes a metabolomics data matrix and processes metabolites into likely artifacts versus likely true metabolites. Biological samples should follow a randomized injection order with pooled plasma samples interspersed. Columns of data should be samples and rows are metabolites. Columns must be ordered by injection order. Metabolites are first grouped by missing rate of pooled plasma and then processed based on metrics of blocky structure to identify likely artifacts. Specifically, corr_metric and run_metric are used to quantify the degree to which structure is present in the patterns of missing data. Must pass all thresholds to be considered a true metabolite.

Usage

1
2
3
4
met_proc(df, numsplit = 5, cor_rates = c(0.6, 0.65, 0.65, 0.65, 0.6),
runlengths = c(NA, 15, 15, 15, NA), mincut = 0.02, maxcut = 0.95, scut = 0.5,
ppkey = "PPP", sidkey = "X", missratecut=0.01, histcolors=c('white'), plot=TRUE,
outfile='MetProc_output')

Arguments

df

The metabolomics dataset, ideally read from the read.met function. Each column represents a sample and each row represents a metabolite. Columns should be labeled with some unique prefix denoting whether the column is from a biological sample or pooled plasma sample. For example, all pooled plasma samples may have columns identified by the prefix “PPP” and all biological samples may have columns identified by the prefix “X”. Missing data must be coded as NA. Columns must be ordered by injection order.

numsplit

The number of equal sized sections to divide metabolites into based on missing rate of pooled plasma columns. Divides the range of missing rates between mincut and maxcut into equal sections. Default is 5.

cor_rates

A vector of length equal to numsplit. Each value represents the cutoff of the correlation metric in that section. Any metabolite with a value greater than or equal to the cutoff is considred an artifact and anything less than the cutoff is considered a true metabolite. If any value is set to NA, the correlation metric will not be considered for that group. One cutoff per group. Default is c(.6, .65, .65, .65, .6).

runlengths

A vector of length equal to numsplit. Each values represents the cutoff for the longest run metric in that section. Any metabolite with a run greater than or equal to the cutoff is considered an artifact and anything less than the cutoff is considred a true metabolite. If any value is set to NA, the longest run metric will not be considered for that group. One cutoff per group. Default is c(NA, 15, 15, 15, NA).

mincut

A cutoff to specify that any metabolite with pooled plasma missing rate less than or equal to this value should be retained. Default is 0.02.

maxcut

A cutoff to specify that any metabolite with pooled plasma missing rate greater than this value should be removed. Default is 0.95.

scut

The cutoff of missingness to consider a metabolite as having data present in a given biological sample block. Relevant only to run_metric computation. Default is 0.5.

ppkey

The unique prefix of pooled plasma columns. Default is "PPP".

sidkey

The unique prefix of biological samples columns. Default is "X".

missratecut

A parameter for heatmap plots when plot=TRUE. Only metabolites with missing rates (across pooled plasma and biological samples) equal to or greater than this cutoff will be plotted. Useful to avoid plotting too may metabolites in an effort to save time. If a metabolite has a very small missing rate, plotting is uninformative as all data is present. Default is 0.01.

plot

Indicate whether you would like to obtain plots of missingness patterns and distributions of calculated metrics. Plots will be output as a PDF. Default is TRUE.

histcolors

A vector of length equal to numsplit. Each value represents the color to use for that group in the histograms of the longest run and correlation metrics for each subset of metabolites. If no color is provided, they will be colored white.

outfile

Name and path of the file to store images if plot=TRUE. Do not include file extension in the name. Default is "MetProc_output" which will save a file called MetProc_output.pdf in the current working directory.

Details

The function uses a four step process:

1. Retain all metabolites with pooled plasma missing rate below mincut and remove all metabolites with pooled plasma missing rate above maxcut.
2. Split the remaining metabolites into numsplit groups that are defined by pooled plasma missing rates. The numsplit groups will divide the range of pooled plasma missing rates evenly.
3. For each group of metabolites based on pooled plasma missing rates from step 2, calculate the correlation metric with corr_metric. Any metabolite below the cutoff for that group, defined by cor_rates, will be retained and any metabolite above will be removed.
4. For each group of metabolites based on pooled plasma missing rates from step 2, calculate the longest run metric with run_metric. Any metabolite below the cutoff for that group, defined by runlengths, will be retained and any metabolite above will be removed.

Value

keep

A dataframe of the retained metabolites

remove

A dataframe of the removed metabolites

If plot = True, a PDF file will be saved containing the correspondence between pooled plasma missing rate and sample missing rate, the distribution of the correlation metric and longest run metric in each of the groups based on pooled plasma missing rates, and heatmaps displaying the patterns of present/missing data for both the removed and retained metabolites.

See Also

See run_metric for details on the longest run metric.
See corr_metric for details on the correlation metric.
See MetProc-package for examples of running the full process.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
library(MetProc)

#Read in metabolomics data
metdata <- read.met(system.file("extdata/sampledata.csv", package="MetProc"),
headrow=3, metidcol=1, fvalue=8, sep=",", ppkey="PPP", ippkey="BPP")

#Separate likely artifacts from true signal using default settings
results <- met_proc(metdata,plot=FALSE)

#Separate likely artifacts from true signal using custom cutoffs and criteria
#Uses 5 groups of metabolites based on the pooled plasma missing rate, applies
#custom metric thersholds, sets the minimum pooled plasma missing rate to 0.05,
#sets the maximum pooled plasma missing rate to 0.95, sets the missing rate
#to consider a block of samples present at 0.6
results <- met_proc(metdata, numsplit = 5, cor_rates = c(0.4,.7,.75,.7,.4),
runlengths = c(80, 10, 12, 10, 80), mincut = 0.05, maxcut = 0.95, scut = 0.6,
ppkey = 'PPP', sidkey = 'X', plot = FALSE)

#Uses default criteria for running met_proc, but plots the results
#and saves them in a PDF in the current directory.
#Colors of the histograms set by histcolors.
#Adding plots may substantially increase running time if many
#samples are included
results <- met_proc(metdata, plot = TRUE, missratecut = 0.001, 
histcolors = c('red','yellow','green','blue','purple'))