knitr::opts_chunk$set(
  fig.width = 7,
  fig.height = 5,
  collapse = TRUE,
  comment = "#>"
)

The AlpsLCMS package was written with two purposes in mind:

Functions from this package written for data analysts and LC-MS scientists are prefixed with lcms_, while higher level functions written for IT pipeline builders are prefixed with pipe_. The main reason why all exported functions have a prefix is to make it easy for the user to discover the functions from the package. By typing lcms_ RStudio will return the list of exported functions. In the R terminal, lcms_ followed by the tab key (⇥) twice will have the same effect. Other popular packages, follow similar approaches (e.g: forcats: fct_*, stringr: str_*).

This vignette is written for the first group. It assumes some prior basic knowledge of LC-MS and data analysis, as well as some basic R programming. In case you are interested in building pipelines with this package, you may want to open the file saved in this directory (run it on your computer):

pipeline_example <- system.file("pipeline-rmd", "pipeline_example.R", package = "AlpsLCMS")
print(pipeline_example)
library(AlpsLCMS)

NODE 1: Data wrangling

Convert RAW to mzXML

First, we need the listed spectra ideally in ".mzXML" or ".cdf" formats to create the lcms_dataset. We can manually convert ".raw" into ".mzXML" using RawConverter or ProteoWizard externally and saved them within the same input directory. In ".cdf" files, we need to set the polarity manually.

Input

input_dir <- "C:/Users/hgracia/Desktop/IBEC/data/positive"
path <- input_dir
# The format of LC-MS files to list (or list and convert).
file_format <- "mzXML"

# Directory where the RawConverter is.
# We recommend converting files outside R and set this to NULL
rawconverter <- "C:/Users/hgrac/AppData/Local/Apps/2.0/0W3LE4AA.3V9/HAK1HQ15.LLV/rawc..tion_0000000000000000_0001.0001_1563e82fa65e7a77/RawConverter.exe"
#rawconverter <- "C:/Users/fmadrid/Documents/IBEC/NESTLE/LCMS/RawConverter_x64/RawConverter.exe"

Code to run: list of samples

samples_mzxml <- lcms_list_mzxml_samples(path, file_format = file_format,
                                    rawconverter = rawconverter)
samples_mzxml = as.character(samples_mzxml)

Code to run (II): creating the dataset object

# Be careful setting the mode to "onDisk" when you apply this function.
dataset <- suppressWarnings(lcms_read_samples(samples_mzxml, mode = "onDisk"))
message("Is the dataset correctly loaded? \n")
cat("\n")
print(dataset)

NODE 2: Append metadata

To merge the metadata, an Excel o CSV file is required, containing the first column (called "sampleNames") with the same name of the LC-MS files, ending with the format (e.g. Sample1.mzXML).

It requires a column (called "treatment") with the class sample. Ensure you have this specific colname "treatment".

Caution with metadata. The use of characters such as "-", "/", " " (space) and starting with numbers, etc. leads to problems. Therefore, the function replace [\\\"\\s/\\\\,;.:|#@$%&?!*%+-=><^'(){}\\[\\]]+ by _. Beware of using other special characters and change them by usual ASCII characters.

Input parameters

# Create a metadata for this example dataset
metadata <- readxl::read_excel("C:/Users/hgracia/Desktop/IBEC/data/20200128_Sample_Info_adapted.xlsx")

Code to run

dataset <- lcms_meta_add(dataset, metadata, by = "sampleNames")
phData(dataset)

NODE 3: Total ion chromatogram

tics <- lcms_tics(dataset, treatment = "treatment")

lcms_plot_tics(tics,
               treatment = treatment, 
               plot_type = "spec")

lcms_plot_tics(tics, treatment = treatment,
               plot_type = "boxplot")

NODE 4: Filtering

Filter by retention time / m/z

For coherence with the pipeline, time is measured in minutes. XCMS and Autotuner packages work in seconds by default, while CAMERA in minutes. Under the hood, the still do in this way, but we preset all our results in minutes.

Input

# Range of the retention time (minutes) to include in further analyses
rt = c(1.6, 15.5)
ms = c(350, 900)

#rt = c(50, 60)
#ms = c(200, 500)

Code to run

dataset_shorter <- lcms_filter_rt_min(dataset, rt = rt)
dataset_shorter <- lcms_filter_mz(dataset_shorter, mz = ms)
tics <- lcms_tics(dataset_shorter, treatment = "treatment")

lcms_plot_tics(tics,
               treatment = treatment, 
               plot_type = "spec")

NODE 5: Optimization of parameters: IPO

TODO: Añadir autotune o lo que sea para optimizar parametros

Input

prep_parm_p <- NULL
prep_parm_p$peakwidth <- c(20, 80)
prep_parm_p$noise <- 5000
prep_parm_p$snthresh <- 3
prep_parm_p$prefilter <- c(6, 5000)
prep_parm_p$centerSample <- "wMean"
prep_parm_p$integrate <- 2
prep_parm_p$mzdiff <- -0.001
prep_parm_p$profStep <- 0.005
prep_parm_p$minFraction <- 0.2
prep_parm_p$ppm <- 25
prep_parm_p$mzCenterFun <- "wMean"
prep_parm_p$fitgauss <- FALSE
prep_parm_p$verbose.columns <- FALSE

classes <- dataset@phenoData@data[["treatment"]]

NODE 6: Preprocessing with xcms

Peak detection

peakdet = find_peaks_cwp(dataset, 
                                    params = prep_parm_p)

message("Number of detected peaks")
peakdet@msFeatureData[["chromPeakData"]]@nrows
message("")

message("Parameters")
peakdet@.processHistory[[1]]@param

xcms::plotChromPeakImage(peakdet)

Retention Time Correction and Peak Correspondence

Peak correspondence is carried out by the 'groupPeaks' method, with parameters obtained form IPO. Peak Correspondence consist in grouping peaks on retention time axis with the purpose of associate them to spectra on the mass/charge axis. After this stage we finally have a peak table.

It requires a previous grouping.

Correspondence

new_params <- PeakDensityPar(sampleGroups = classes, 
                               binSize = 0.6)

peakgrouped <- groupPeaks(peakdet, 
                         param = new_params)
message("Number of detected peaks")
peakdet@msFeatureData[["chromPeakData"]]@nrows

Alignment and regrouping

pgp <- PeakGroupsPar(minFraction = 0.8,
                       extraPeaks = 1, 
                       smooth = "loess",
                       span = 0.4,
                       family = "gaussian")

## Get the peak groups that would be used for alignment.
xdata_aling <- adjustRT(peakgrouped, param = pgp)

rt_plot = lcms_retention_time_alignment_plot(xdata_aling)
rt_plot

## REGROUPING
new_params <- PeakDensityPar(sampleGroups = classes,
                             bw = 30, 
                             minFraction = 0.4)

peakgrouped = groupPeaks(xdata_aling, param = new_params)

Image plot of the chromatographic detected peaks per sample before retention time alignment and grouping:

lcms_plot_chrom_peak_image(peakdet, binSize = 5,
                           xlim = NULL,
                           log = FALSE,
                           xlab = "retention time (min)",
                           yaxt = par("yaxt"),
                           main = "Detected Peaks (unprocessed)")

lcms_plot_chrom_peak_image(peakgrouped, binSize = 5,
                           xlim = NULL,
                           log = FALSE,
                           xlab = "retention time (min)",
                           yaxt = par("yaxt"),
                           main = "Detected Peaks (processed)")

Imputation

Finally, in the imputation stage, we integrate the areas of the missing peaks of the peak table that were not detected in the previous steps of the signal preprocessing workflow. This stage is important to make easier statistical and machine learnig posterior stages.

message("Missing values found in the processed dataset: ", sum(is.na(featureValues(peakgrouped))))

peakgrouped_imp <- lcms_fill_chrom_peaks(peakgrouped)
cat("Imputing values...\n")

message("Missing values found after fill_chrom_peaks: ", sum(is.na(featureValues(peakgrouped_imp))))

NODE 8: Feature table

The function feature_values creates a intensity matrix with all the features. However, since this is untargeted metabolomics, the colnames are composed by FT1, FT2, FT3... (FT = feature) and each feature needs to be explored with the feature definition function (info for mass and rt) once a feature is significantly different by groups.

Merging into a Feature Table

xdata = feature_values(peakgrouped_imp,
                      method = "maxint",
                      value = "into",
                      filled = TRUE, 
                      missing = "rowmin_half")
xdata <-  t(xdata)
feature <- featureDefinitions(peakgrouped_imp)
feature <- feature@listData
featNames <- paste0(feature$mzmed,"_",feature$rtmed)
colnames(xdata) <- featNames

message("Missing values in the feature table: ",
sum(is.na(xdata)))
# Get mz and rt columns for the feature table
mz <- colnames(xdata) %>%
  stringr::str_split(.,"\\_") %>% 
  lapply(.,function(x) x[1]) %>% 
  unlist() %>% 
  as.numeric()

#You can get rt also
rt <- colnames(xdata) %>%
  stringr::str_split(.,"\\_") %>% 
  lapply(.,function(x) x[2]) %>% 
  unlist() %>% 
  as.numeric()
rt <- rt/60

NODE 9: Data reduction

Input

st <- getRamSt(peakgrouped_imp)
sr <- 0.6

#List of adducts for do.findmain
#adducts_list = c("[M+H-H2O]+")
adducts_list = c()

## Building the defineExperiment manually
## Change for your convenience (e.g. GC-MS)
value <- c(rep("fill", 4), "LC-MS")
design <- as.data.frame(value)
rownme <- c("Experiment", "Species", "Sample",
            "Contributer", "platform")  
rownames(design) <- rownme

value <- c(rep("fill", 13), "1")
instrument <- as.data.frame(value)
rownm <- c("chrominst", "msinst", "column", 
           "solvA", "solvB", "CE1", "CE2", 
           "mstype", "msmode", "ionization", 
           "colgas", "msscanrange", "conevol", 
           "MSlevs")
rownames(instrument) <- rownm

Experiment <- list(design =  design, instrument = instrument)

Code to run

RC <- clustering(xcmsObj = peakgrouped_imp,
                 featdelim = ".",
                 st = st,
                 sr = sr,
                 ExpDes = Experiment,
                 normalize = "TIC",
                 deepSplit = TRUE,
                 sampNameCol = 1,
                 mspout = FALSE,
                 fftempdir = getwd())

RC <- do_findmain(RC,
                  nls = c("[M+H-H2O]+"),
                  mode = "positive",
                  mzabs.error = 0.01,
                  ppm.error = 10,
                  plot.findmain = FALSE,
                  writeMat = FALSE,
                  writeMS = FALSE)
#labeled_adducts <- labelling(RC)
#representative_ions <- labeled_adducts$representative_ions
#xdata_reduced <- feature_reduction(xdata, representative_ions, RC)

reduced_object  <- feature_reduction(RC, xdata, loaded = FALSE)
xdata_reduced <- reduced_object$xdata_reduced

dim(xdata)
dim(xdata_reduced)

NODE 10: Data analysis

TODO: adaptar bootstrap de alpsnmr a alpslcms

Bootstrap and permutation

y_column = dataset_shorter@phenoData@data[["treatment"]]
#TODO da un error al calcular las performances de los modelos, pero es por el numero de muestras tan pequeño
#TODO probar con el dataset positive de nestle
#TODO intentar que el perf no use folds
# y_column = dataset@phenoData@data[["treatment"]]
# dataset <- xdata_reduced
# train_index = c(1,2,3,7,8,9)
# ncomp = 3
# nbootstrap = 10
# bp_results <- bp_VIP_analysis(xdata_reduced,
#                             train_index = c(1,2,3,7,8,9),
#                             y_column,
#                             ncomp = 3,
#                             nbootstrap = 10)


# # Bootstrap and permutation method in k-fold cross validation
# bp_results <-
#     bp_kfold_VIP_analysis(xdata_reduced, # Data to be analized
#                           y_column = y_column, # list of labels that indicates the class
#                           k = 4,
#                           ncomp = 2,
#                           nbootstrap = 30)
# #bp_results$vip_score_plot

Univariante analysis

# stat <- function(x){stats::wilcox.test(x ~ classes, xdata_reduced)$p.value}
# abcd <- data.frame(apply(FUN = stat,
#                          MARGIN = 2,
#                          X = xdata_reduced))
# colnames(abcd) <- c("p_Wilc")
# abcd$id <- row.names(abcd)
# result <- abcd
# summary(result$p_Wilc)
# message("\nNumber of features < 0.05 nominal p-value ", 
# sum(result$p_Wilc < 0.05))
# head(result[result$p_Wilc < 0.05, 1])
# 
# #FDR
# fdr.wilcox <- stats::p.adjust(result$p_Wilc, method = "fdr")
# result <- cbind(result, fdr.wilcox)
# message("\nNumber of features fdr-corrected p value of < 0.05 is ", 
# sum(result$fdr.wilcox < 0.05))

Univariate anotation

# # We use the selected vips
# univ_feat <- result[result$fdr.wilcox < 0.05,"id"]
# 
# # Untargeted assignation
# # Creating mz column
# mzr <- univ_feat %>%
#   stringr::str_split(.,"\\_") %>%
#   lapply(.,function(x) x[1]) %>%
#   unlist() %>%
#   as.numeric()
# 
# all.equal(length(univ_feat), length(mzr))
# 
# tdata_reduced_univ <- data.frame(mz = mzr)
# if(dim(tdata_reduced_univ)[1]>0){
#   result_POS_HMDB_univ <- assignation_pos_HMDB(tdata_reduced_univ)
#   head(result_POS_HMDB_univ)
# } else {
#   message("There is no significant features in Wilcox test")
# }

anotation of all features positive

# # Untargeted assignation
# # Creating mz column
# 
# mzr <- colnames(as.matrix(xdata)) %>%
#   stringr::str_split(.,"\\_") %>%
#   lapply(.,function(x) x[1]) %>%
#   unlist() %>%
#   as.numeric()
# 
# all.equal(dim(xdata)[2], length(mzr))
# 
# tdata_reduced_all_features <- data.frame(mz = mzr)
# result_POS_HMDB_all_features <- assignation_pos_HMDB(tdata_reduced_all_features)
# head(result_POS_HMDB_all_features)

Final thoughts

This vignette shows many of the features of the package, some features have room for improvement, others are not fully described, and the reader will need to browse the documentation. Hopefully it is a good starting point for using the package.



sipss/AlpsLCMS documentation built on May 13, 2021, 6:18 p.m.