knitr::opts_chunk$set( fig.width = 7, fig.height = 5, collapse = TRUE, comment = "#>" )
The AlpsLCMS
package was written with two purposes in mind:
Functions from this package written for data analysts and LC-MS scientists are
prefixed with lcms_
, while higher level functions written for IT pipeline
builders are prefixed with pipe_
. The main reason why all exported functions
have a prefix is to make it easy for the user to discover the functions from the
package. By typing lcms_ RStudio will return the list of exported functions. In
the R terminal, lcms_ followed by the tab key (⇥) twice will have the same
effect. Other popular packages, follow similar approaches (e.g: forcats
:
fct_*
, stringr
: str_*
).
This vignette is written for the first group. It assumes some prior basic knowledge of LC-MS and data analysis, as well as some basic R programming. In case you are interested in building pipelines with this package, you may want to open the file saved in this directory (run it on your computer):
pipeline_example <- system.file("pipeline-rmd", "pipeline_example.R", package = "AlpsLCMS") print(pipeline_example)
library(AlpsLCMS)
First, we need the listed spectra ideally in ".mzXML"
or ".cdf"
formats to create the lcms_dataset
. We can manually convert ".raw"
into ".mzXML"
using RawConverter or ProteoWizard externally and saved them within the same input directory. In ".cdf"
files, we need to set the polarity manually.
input_dir <- "C:/Users/hgracia/Desktop/IBEC/data/positive" path <- input_dir # The format of LC-MS files to list (or list and convert). file_format <- "mzXML" # Directory where the RawConverter is. # We recommend converting files outside R and set this to NULL rawconverter <- "C:/Users/hgrac/AppData/Local/Apps/2.0/0W3LE4AA.3V9/HAK1HQ15.LLV/rawc..tion_0000000000000000_0001.0001_1563e82fa65e7a77/RawConverter.exe" #rawconverter <- "C:/Users/fmadrid/Documents/IBEC/NESTLE/LCMS/RawConverter_x64/RawConverter.exe"
samples_mzxml <- lcms_list_mzxml_samples(path, file_format = file_format, rawconverter = rawconverter) samples_mzxml = as.character(samples_mzxml)
# Be careful setting the mode to "onDisk" when you apply this function. dataset <- suppressWarnings(lcms_read_samples(samples_mzxml, mode = "onDisk")) message("Is the dataset correctly loaded? \n") cat("\n") print(dataset)
To merge the metadata, an Excel o CSV file is required, containing the first column (called "sampleNames") with the same name of the LC-MS files, ending with the format (e.g. Sample1.mzXML).
It requires a column (called "treatment") with the class sample. Ensure you have this specific colname "treatment".
Caution with metadata. The use of characters such as "-", "/", " " (space) and starting with numbers, etc. leads to problems. Therefore, the function replace [\\\"\\s/\\\\,;.:|#@$%&?!*%+-=><^'(){}\\[\\]]+
by _
. Beware of using other special characters and change them by usual ASCII characters.
# Create a metadata for this example dataset metadata <- readxl::read_excel("C:/Users/hgracia/Desktop/IBEC/data/20200128_Sample_Info_adapted.xlsx")
dataset <- lcms_meta_add(dataset, metadata, by = "sampleNames") phData(dataset)
tics <- lcms_tics(dataset, treatment = "treatment") lcms_plot_tics(tics, treatment = treatment, plot_type = "spec") lcms_plot_tics(tics, treatment = treatment, plot_type = "boxplot")
For coherence with the pipeline, time is measured in minutes. XCMS
and Autotuner
packages work in seconds by default, while CAMERA
in minutes. Under the hood, the still do in this way, but we preset all our results in minutes.
# Range of the retention time (minutes) to include in further analyses rt = c(1.6, 15.5) ms = c(350, 900) #rt = c(50, 60) #ms = c(200, 500)
dataset_shorter <- lcms_filter_rt_min(dataset, rt = rt) dataset_shorter <- lcms_filter_mz(dataset_shorter, mz = ms) tics <- lcms_tics(dataset_shorter, treatment = "treatment") lcms_plot_tics(tics, treatment = treatment, plot_type = "spec")
prep_parm_p <- NULL prep_parm_p$peakwidth <- c(20, 80) prep_parm_p$noise <- 5000 prep_parm_p$snthresh <- 3 prep_parm_p$prefilter <- c(6, 5000) prep_parm_p$centerSample <- "wMean" prep_parm_p$integrate <- 2 prep_parm_p$mzdiff <- -0.001 prep_parm_p$profStep <- 0.005 prep_parm_p$minFraction <- 0.2 prep_parm_p$ppm <- 25 prep_parm_p$mzCenterFun <- "wMean" prep_parm_p$fitgauss <- FALSE prep_parm_p$verbose.columns <- FALSE classes <- dataset@phenoData@data[["treatment"]]
peakdet = find_peaks_cwp(dataset, params = prep_parm_p) message("Number of detected peaks") peakdet@msFeatureData[["chromPeakData"]]@nrows message("") message("Parameters") peakdet@.processHistory[[1]]@param xcms::plotChromPeakImage(peakdet)
Peak correspondence is carried out by the 'groupPeaks' method, with parameters obtained form IPO
. Peak Correspondence consist in grouping peaks on retention time axis with the purpose of associate them to spectra on the mass/charge axis. After this stage we finally have a peak table.
It requires a previous grouping.
new_params <- PeakDensityPar(sampleGroups = classes, binSize = 0.6) peakgrouped <- groupPeaks(peakdet, param = new_params)
message("Number of detected peaks") peakdet@msFeatureData[["chromPeakData"]]@nrows
pgp <- PeakGroupsPar(minFraction = 0.8, extraPeaks = 1, smooth = "loess", span = 0.4, family = "gaussian") ## Get the peak groups that would be used for alignment. xdata_aling <- adjustRT(peakgrouped, param = pgp) rt_plot = lcms_retention_time_alignment_plot(xdata_aling) rt_plot ## REGROUPING new_params <- PeakDensityPar(sampleGroups = classes, bw = 30, minFraction = 0.4) peakgrouped = groupPeaks(xdata_aling, param = new_params)
Image plot of the chromatographic detected peaks per sample before retention time alignment and grouping:
lcms_plot_chrom_peak_image(peakdet, binSize = 5, xlim = NULL, log = FALSE, xlab = "retention time (min)", yaxt = par("yaxt"), main = "Detected Peaks (unprocessed)") lcms_plot_chrom_peak_image(peakgrouped, binSize = 5, xlim = NULL, log = FALSE, xlab = "retention time (min)", yaxt = par("yaxt"), main = "Detected Peaks (processed)")
Finally, in the imputation stage, we integrate the areas of the missing peaks of the peak table that were not detected in the previous steps of the signal preprocessing workflow. This stage is important to make easier statistical and machine learnig posterior stages.
message("Missing values found in the processed dataset: ", sum(is.na(featureValues(peakgrouped)))) peakgrouped_imp <- lcms_fill_chrom_peaks(peakgrouped) cat("Imputing values...\n") message("Missing values found after fill_chrom_peaks: ", sum(is.na(featureValues(peakgrouped_imp))))
The function feature_values
creates a intensity matrix with all the features. However, since this is untargeted metabolomics, the colnames are composed by FT1, FT2, FT3... (FT = feature) and each feature needs to be explored with the feature definition function (info for mass and rt) once a feature is significantly different by groups.
xdata = feature_values(peakgrouped_imp, method = "maxint", value = "into", filled = TRUE, missing = "rowmin_half") xdata <- t(xdata) feature <- featureDefinitions(peakgrouped_imp) feature <- feature@listData featNames <- paste0(feature$mzmed,"_",feature$rtmed) colnames(xdata) <- featNames message("Missing values in the feature table: ", sum(is.na(xdata)))
# Get mz and rt columns for the feature table mz <- colnames(xdata) %>% stringr::str_split(.,"\\_") %>% lapply(.,function(x) x[1]) %>% unlist() %>% as.numeric() #You can get rt also rt <- colnames(xdata) %>% stringr::str_split(.,"\\_") %>% lapply(.,function(x) x[2]) %>% unlist() %>% as.numeric() rt <- rt/60
st <- getRamSt(peakgrouped_imp) sr <- 0.6 #List of adducts for do.findmain #adducts_list = c("[M+H-H2O]+") adducts_list = c() ## Building the defineExperiment manually ## Change for your convenience (e.g. GC-MS) value <- c(rep("fill", 4), "LC-MS") design <- as.data.frame(value) rownme <- c("Experiment", "Species", "Sample", "Contributer", "platform") rownames(design) <- rownme value <- c(rep("fill", 13), "1") instrument <- as.data.frame(value) rownm <- c("chrominst", "msinst", "column", "solvA", "solvB", "CE1", "CE2", "mstype", "msmode", "ionization", "colgas", "msscanrange", "conevol", "MSlevs") rownames(instrument) <- rownm Experiment <- list(design = design, instrument = instrument)
RC <- clustering(xcmsObj = peakgrouped_imp, featdelim = ".", st = st, sr = sr, ExpDes = Experiment, normalize = "TIC", deepSplit = TRUE, sampNameCol = 1, mspout = FALSE, fftempdir = getwd()) RC <- do_findmain(RC, nls = c("[M+H-H2O]+"), mode = "positive", mzabs.error = 0.01, ppm.error = 10, plot.findmain = FALSE, writeMat = FALSE, writeMS = FALSE)
#labeled_adducts <- labelling(RC) #representative_ions <- labeled_adducts$representative_ions #xdata_reduced <- feature_reduction(xdata, representative_ions, RC) reduced_object <- feature_reduction(RC, xdata, loaded = FALSE) xdata_reduced <- reduced_object$xdata_reduced dim(xdata) dim(xdata_reduced)
y_column = dataset_shorter@phenoData@data[["treatment"]] #TODO da un error al calcular las performances de los modelos, pero es por el numero de muestras tan pequeño #TODO probar con el dataset positive de nestle #TODO intentar que el perf no use folds # y_column = dataset@phenoData@data[["treatment"]] # dataset <- xdata_reduced # train_index = c(1,2,3,7,8,9) # ncomp = 3 # nbootstrap = 10 # bp_results <- bp_VIP_analysis(xdata_reduced, # train_index = c(1,2,3,7,8,9), # y_column, # ncomp = 3, # nbootstrap = 10) # # Bootstrap and permutation method in k-fold cross validation # bp_results <- # bp_kfold_VIP_analysis(xdata_reduced, # Data to be analized # y_column = y_column, # list of labels that indicates the class # k = 4, # ncomp = 2, # nbootstrap = 30) # #bp_results$vip_score_plot
# stat <- function(x){stats::wilcox.test(x ~ classes, xdata_reduced)$p.value} # abcd <- data.frame(apply(FUN = stat, # MARGIN = 2, # X = xdata_reduced)) # colnames(abcd) <- c("p_Wilc") # abcd$id <- row.names(abcd) # result <- abcd # summary(result$p_Wilc) # message("\nNumber of features < 0.05 nominal p-value ", # sum(result$p_Wilc < 0.05)) # head(result[result$p_Wilc < 0.05, 1]) # # #FDR # fdr.wilcox <- stats::p.adjust(result$p_Wilc, method = "fdr") # result <- cbind(result, fdr.wilcox) # message("\nNumber of features fdr-corrected p value of < 0.05 is ", # sum(result$fdr.wilcox < 0.05))
# # We use the selected vips # univ_feat <- result[result$fdr.wilcox < 0.05,"id"] # # # Untargeted assignation # # Creating mz column # mzr <- univ_feat %>% # stringr::str_split(.,"\\_") %>% # lapply(.,function(x) x[1]) %>% # unlist() %>% # as.numeric() # # all.equal(length(univ_feat), length(mzr)) # # tdata_reduced_univ <- data.frame(mz = mzr) # if(dim(tdata_reduced_univ)[1]>0){ # result_POS_HMDB_univ <- assignation_pos_HMDB(tdata_reduced_univ) # head(result_POS_HMDB_univ) # } else { # message("There is no significant features in Wilcox test") # }
# # Untargeted assignation # # Creating mz column # # mzr <- colnames(as.matrix(xdata)) %>% # stringr::str_split(.,"\\_") %>% # lapply(.,function(x) x[1]) %>% # unlist() %>% # as.numeric() # # all.equal(dim(xdata)[2], length(mzr)) # # tdata_reduced_all_features <- data.frame(mz = mzr) # result_POS_HMDB_all_features <- assignation_pos_HMDB(tdata_reduced_all_features) # head(result_POS_HMDB_all_features)
This vignette shows many of the features of the package, some features have room for improvement, others are not fully described, and the reader will need to browse the documentation. Hopefully it is a good starting point for using the package.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.