knitr::opts_chunk$set(fig.width=6, fig.height=6, fig.path='figures/')

Context

The objective of the w4m2bioc package is to facilitate the handling of preprocessed data and metadata (i.e., after the XCMS and CAMERA steps in metabolomics) between the Galaxy-based Workflow4metabolomics infrastructure [@Giacomoni2015] and the R environment [@RCoreTeam2016]. Preprocessed data and metadata are handled by the Galaxy modules from the Workflow4Metabolomics infrastructure as three tabulated .tsv files. Within R such data and metadata can be conveniently handled in an ExpressionSet object from the Biobase bioconductor package [@Hubert2015]. The w4m2bioc package thus provides function and methods to import/export the three .tsv files into/from an ExpressionSet object.

The w4m2bioc package

Preprocessed data and metadata within the Workflow4metabolomics infrastructure consists of three tabulated .tsv files:

There is no constraint regarding the content of the sampleMetadata and variableMetadata columns, to allow maximum flexibility with different types of omic data sets. The only constraints are that all three tables have row and column names (without duplicated or missing values) and that there is an exact match between the row names of the dataMatrix and sampleMetadata (sample names) on one hand, and between the column names of the dataMatrix and the row names of the variableMetadata (variable names) on the other hand.

The ExpressionSet class from Bioconductor includes three slots which can be used to store these tables: the assayData, the phenoData, and the featureData.

Hands-on

Package loading

Let us first load the package:

library(w4m2bioc)

Import from W4M to the ExpressionSet class

We can then build the sacSet object by reading the 3 tables containing the data intensities (dataMatrix), and the sample and variable metadata (sampleMetadata and variableMetadata, respectively):

You can have a look at these tabular files with Excel since they are in the extdata folder of the installed w4m2bioc package.

We use the readw4m function to build the object which will contain the 3 tables (one matrix of numerics and two data frames):

sacSet <- readw4m(file.path(path.package("w4m2bioc"), "extdata"))
sacSet

Notes:

  1. A warning message is printed when some variable (or samples) names in the initial tables are not syntactically correct for R (here the variable names in the dataMatrix.tsv and variableMetadata.tsv files in the package have already been formatted with the make.names function). The warning message can be hidden with the verboseL = FALSE argument. If duplicates are present, the call to readw4m generates an error.

  2. The sample and variable names can be accessed and modified, using the sampleNames and featureNames accessor from the ExpressionSet object:

library(Biobase)
varNamesVc <- featureNames(sacSet)
featureNames(sacSet) <- make.names(varNamesVc)

Check the format of the 3 tables within the ExpressionSet object

checkw4m(sacSet)

Access the 3 tables within the ExpressionSet object

We can access the dataMatrix, sampleMetadata and variableMetadata from the ExpressionSet by using the exprs, pData, and fData methods, respectively. Suppose for instance that we want to transform the intensities back to the arithmetic scale (they have been log10 transformed in the 'dataMatrix.tsv' file):

sacDataMN <- exprs(sacSet)
sacArithDataMN <- 10^sacDataMN
sacArithSet <- sacSet
exprs(sacArithSet) <- sacArithDataMN
checkw4m(sacSet)

Notes:

  1. In the data matrix exported from the ExpressionSet, the samples are stored as columns.

  2. The compatibility of the dimensions and sample/variable names of the new data matrix are not automatically checked during the replacement: hence we check the integrity of the object afterwards.

Use the ExpressionSet methods (e.g. for subsetting)

We can also subset the samples and/or variables. Suppose that we would like to restrict the dataset to the female volunteers:

sacSamDF <- pData(sacSet)
sacGenderVc <- sacSamDF[, "gender"]
table(sacGenderVc)
femaleVl <- sacGenderVc == "F"
sacFemaleSet <- sacSet[, femaleVl]
sacFemaleSet

Multivariate analysis

Multivariate analysis (e.g. PLS-DA) can be performed on our ExpressionSet object by using the ropls bioconductor package [@Thevenot2015] and indicating the name of the column of the sample metadata to be used as the response:

library(ropls)
sacGenderPlsda <- opls(sacSet, "gender")
library(ropls)
sacGenderPlsda <- opls(sacSet, "gender", plotL = FALSE)
layout(matrix(1:4, nrow = 2, byrow = TRUE))
for(typeC in c("overview", "outlier", "x-score", "x-loading"))
plot(sacGenderPlsda, typeVc = typeC, parDevNewL = FALSE)

Export to W4M tabulated file format

Should we export our ExpressionSet objet back the W4M 3 tabulated file formats, we use the writew4m method:

writew4m(sacFemaleSet, filePrefixC = file.path(getwd(), "sacFemale_"))

Session info

Here is the output of sessionInfo() on the system on which this document was compiled:

sessionInfo()

References



ethevenot/r-w4m2bioc documentation built on May 16, 2019, 9:06 a.m.