knitr::opts_chunk$set(fig.width = 8, fig.height = 8, fig.path = 'figures/temp/')
Analysis of a phenomics (e.g. metabolomics) data set (i.e. samples times variables table of peak or bucket intensities generated by preprocessing tools such as XCMS) is aimed at mining the data (e.g. trends and outliers) and detecting features of predictive value (biomarker discovery). It comprises multiple steps including:
Post-processing (quality control, normalization and/or transformation of intensities)
Statistical analysis (univariate hypothesis testing, multivariate modeling, feature selection)
Annotation (physico-chemical properties, biological pathways)
The phenomis package addresses the two first steps, and can be combined with other packages for multivariate modeling, such as the ropls and biosigner Bioconductor packages, as described below.
Input (i.e. preprocessed) data consists of a 'samples times variables' matrix of intensities (datMatrix numeric matrix), in addition to sample and variable metadata (sampleMetadata and variableMetadata data frames). Theses 3 tables can be conveniently imported to/exported from R as tabular files:
Text and graphics can be handled with the phenomis methods by setting the two arguments:
report.c: if set to 'interactive' [default], messages are displayed interactively; if set to a character corresponding the a filename (with the '.txt' extension), messages are diverted to this file by using the sink function internally (the same file can be appended by successive methods); if set to 'none', messages are suppressed
figure.c: if set to 'interactive' [default], graphics are displayed interactively; if set to a character corresponding the a filename (with the '.pdf' extension), a pdf file with the plot is generated instead; if set to 'none', graphics are suppressed
The phenomis package can be installed from github with:
devtools::install_github("odisce/phenomis")
As an example, we will use the phenomis package to study the sacurine human cohort. The study is aimed at characterizing the physiological variations of the human urine metabolome with age, body mass index (BMI), and gender [\@thevenot_analysis_2015]. Urine samples from 184 volunteers were analyzed by reversed-phase (C18) ultrahigh performance liquid chromatography (UPLC) coupled to high-resolution mass spectrometry (LTQ-Orbitrap). Raw data are publicly available on the MetaboLights repository (MTBLS404).
This vignette describes the statistical analysis of the data set from the negative ionization mode (113 identified metabolites at MSI levels 1 or 2):
correcting: Correction of the signal drift by local regression (loess) modeling of the intensity trend in pool samples [\@dunn_procedures_2011]; Adjustment of offset differences between the two analytical batches by using the average of the pool intensities in each batch [\@van_der_kloet_analytical_2009]
Variable quality control by discarding features with a coefficient of variation above 30% in pool samples
Normalization by the sample osmolality
transforming: log10 transformation
inspecting: Computing metrics to filter out outlier samples according to the Weighted Hotellings'T2 distance [\@tenenhaus_approche_1999], the Z-score of one of the intensity distribution deciles [\@alonso_astream:_2011], and the Z-score of the number of missing values [\@alonso_astream:_2011]. A 0.001 threshold for all p-values results in the HU_096 sample being discarded
hypotesting: Univariate hypothesis testing of significant variations with age, BMI, or between genders (Student's T test with Benjamini Hochberg correction)
PCA exploration of the data set; ropls Bioconductor package [\@thevenot_analysis_2015]
clustering: Hierarchical clustering
OPLS(-DA) modeling of age, BMI and gender; ropls Bioconductor package [\@thevenot_analysis_2015]
Selection of the features which significantly contributes to the discrimination between gender with PLS-DA, Random Forest, or Support Vector Machines classifiers; biosigner Bioconductor package [\@rinaudo_biosigner:_2016]
A Galaxy version of this analysis is available W4M00001 'Sacurine-statistics' on the Workflow4metabolomics.org online infrastructure [\@guitton_create_2017]
The reading function reads the data sets and builds the ExpressionSet object. For additional information about ExpressionSet class, see the "An introduction to Biobase and ExpressionSets" documentation from the Biobase package.
sacurine.se <- reading(system.file("extdata/W4M00001_Sacurine-statistics", package = "phenomis"))
sacurine.se <- inspecting(sacurine.se)
sacurine.se <- correcting(sacurine.se, reference.vc = "pool", col_batch.c = "batch", col_injectionOrder.c = "injectionOrder", col_sampleType.c = "sampleType")
sacurine.se <- inspecting(sacurine.se)
sacurine.se <- sacurine.se[rowData(sacurine.se)[, "pool_CV"] <= 0.3, ]
sacurine.se <- sacurine.se[, colData(sacurine.se)[, "sampleType"] != "pool"] print(sacurine.se)
assay(sacurine.se) <- sweep(assay(sacurine.se), 2, colData(sacurine.se)[, "osmolality"], "/")
sacurine.se <- transforming(sacurine.se, method.c = "log10")
sacurine.se <- inspecting(sacurine.se) sacurine.se <- sacurine.se[, colData(sacurine.se)[, "hotel_pval"] >= 0.001 & colData(sacurine.se)[, "miss_pval"] >= 0.001 & colData(sacurine.se)[, "deci_pval"] >= 0.001]
Final visual check of the data before performing the statistics
phenomis::inspecting(sacurine.se)
sacurine.se <- hypotesting(sacurine.se, test.c = "ttest", factor_names.vc = "gender", adjust.c = "BH", adjust_thresh.n = 0.05)
ropls Bioconductor package (already loaded as a dependance from phenomis) [\@thevenot_analysis_2015]
sacPca <- ropls::opls(sacurine.se, info.txt = NA) ropls::plot(sacPca, parAsColFcVn = colData(sacurine.se)[, "gender"], typeVc = "x-score") ropls::plot(sacPca, parAsColFcVn = colData(sacurine.se)[, "age"], typeVc = "x-score") ropls::plot(sacPca, parAsColFcVn = colData(sacurine.se)[, "bmi"], typeVc = "x-score")
sacurine.se <- ropls::getEset(sacPca)
sacurine.se <- clustering(sacurine.se, correl.c = "spearman", clusters.vi = c(5, 3))
With the ropls Bioconductor package [\@thevenot_analysis_2015]:
sacPlsda <- ropls::opls(sacurine.se, "gender")
sacurine.se <- ropls::getEset(sacPlsda)
With the biosigner Bioconductor package [\@rinaudo_biosigner:_2016]:
sacurine.biosign <- biosigner::biosign(sacurine.se, "gender", seedI = 123) sacurine.se <- biosigner::getEset(sacurine.biosign)
This method is based on the biodb and biodbChebi R packages available on github.
Viewing the parameters from the annotating method and their default values:
phenomis::annotating_parameters()
sacurine.se <- annotating(sacurine.se, database.c = "chebi", param.ls = list(query.type = "mz", query.col = "mass_to_charge", ms.mode = "neg", mz.tol = 10, mz.tol.unit = "ppm", max.results = 3, prefix = "chebiMZ."), report.c = "none") knitr::kable(head(rowData(sacurine.se)[, grep("chebiMZ", colnames(rowData(sacurine.se)))]))
sacurine.se <- annotating(sacurine.se, database.c = "chebi", param.ls = list(query.type = "chebi.id", query.col = "database_identifier", prefix = "chebiID.")) knitr::kable(head(rowData(sacurine.se)[, grep("chebiID", colnames(rowData(sacurine.se)))]))
Loading a local (example) MS database:
msdbDF <- read.table(system.file("extdata/local_ms_db.tsv", package = "phenomis"), header = TRUE, sep = "\t", stringsAsFactors = FALSE)
Querying the local database
sacurine.se <- annotating(sacurine.se, database.c = "local.ms", param.ls = list(query.type = "mz", query.col = "mass_to_charge", ms.mode = "neg", mz.tol = 5, mz.tol.unit = "ppm", local.ms.db = msdbDF, prefix = "localMS."), report.c = "none") knitr::kable(rowData(sacurine.se)[!is.na(rowData(sacurine.se)[, "localMS.accession"]), grep("localMS.", colnames(rowData(sacurine.se)), fixed = TRUE)])
phenomis::writing(sacurine.se, dir.c = getwd())
phenomis::writing(sacurine.se, dir.c = 'figures/temp', overwrite.l = TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.