These packages are created to make your life easier. However, life have to become more complicated sometimes to become easier at the end. Please, be patient. So am I.
I would be grateful for any feedback (bug reports, advices what to implement, etc.)
Step 0. Preparations.
Step 1. Converting QI output to a nice table.
Step 2. Uploading the data.
Step 3. Pre-processing.
Step 4. Exporting to MApckg format.
Step 5. Statistical analysis.
Step 6. Plotting.
source("http://bioconductor.org/biocLite.R") biocLite(c( "netcdf", "imagemagick", "graphviz", "Rserve", "ellipse", "scatterplot3d","pls", "caret", "multicore", "lattice", "Cairo", "randomForest", "e1071","gplots", "som", "xtable", "RColorBrewer", "xcms", "impute", "pcaMethods","siggenes", "globaltest", "GlobalAncova", "Rgraphviz","KEGGgraph", "preprocessCore", "genefilter", "pheatmap", "igraph", "RJSONIO", "SSPA", "caTools", "ROCR", "pROC", "devtools", "abind", "vsn", ask = FALSE)) devtools::install_version("pbkrtest", version="0.4-5")
devtools::install_github("flajole/MSdata") devtools::install_github("flajole/MApckg")
require(MApckg) require(MSdata)
| Use PQI_to_MA
function for proteome QI data.
| Use MQI_to_MA
function for metabolome or lipidome QI data.
You can open the particular function documentation containing the description of all the parameters and examples of usage by this command:
?MQI_to_MA
The auto-recognition of grouping factors is implemented now. For this matter group labels should be standardized in this way:
`GroupingFactor1_GroupingFactor2_GroupingFactor3_etc...`
For example:
`WT_C_0; WT_T_0; WT_C_1; WT_T_1; Mut_C_0; Mut_T_0; Mut_C_1; Mut_T_1`
Where the first factor is phenotype (wild type or mutant); the second one is controlor treatment; the third one is whatever else marker 1 or 0.
The program doesn't care about the actual meaning of all these labels! So just keep the order of factors the same through all group labels.
There could be any number of factors. But if you have a lot of important sample information (for instance, more than 3 factors) it could be inconvenient to type them all. In that case you may just ignore grouping labels and use another option - three separate tables with intensity matrix, sample and peak data (see below).
By default if there is only one factor, it is automatically called "Group"
. In all the other cases factors are called just "Factor1"
, "Factor2"
, etc. You can always change the factor names manually in the resulting table, or just set them in the function call:
MQI_to_MA("C:/QI_data", facNames = c("Species", "Phenotype", "Treatment", "Time"))
(For this example grouping labels should look like Species_Phenotype_Treatment_Time
)
Please always revise the resulting table in case something went wrong!
As for now, there are two options.
| | | | | | | | | | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Sample | S1T0 | S1T1 | S1T2 | S2T0 | S2T1 | S2T2 | S3T0 | S3T1 | S3T2 | S4T0 | S4T1 | S4T2 | | Phenotype | WT | WT | WT | WT | WT | WT | MT | MT | MT | MT | MT | MT | | Time | 0 | 1 | 2 | 0 | 1 | 2 | 0 | 1 | 2 | 0 | 1 | 2 | | 0.4425/385 | 126.39 | 234.11 | 162.68 | 172.44 | 171.53 | 180.31 | 140.23 | 157.38 | 124.62 | 147.23 | 165.29 | 125.85 | | 0.4498/625 | 108.9 | 133.53 | 128.66 | 121.48 | 112.23 | 142.97 | 120.1 | 69.85 | 65.75 | 121.22 | 70.04 | 76.06 | | 0.4711/463 | 124.1 | 166.84 | 156.18 | 158.4 | 190.09 | 152.89 | 102.94 | 67.11 | 83.99 | 110.09 | 98.61 | 77.91 | | 0.4715/659 | 688.13 | 739.41 | 529.62 | 828.48 | 660.76 | 545.32 | 289.67 | 359.43 | 240.13 | 314.71 | 338.43 | 203.29 | | 0.4766/757 | 117.8 | 110.31 | 114.56 | 144.39 | 155.33 | 114.47 | 80.53 | 84.3 | 42.33 | 73.74 | 97.78 | 58 | | 0.4772/675 | 182.72 | 209.42 | 144.61 | 255.9 | 199.4 | 179.54 | 112.55 | 112 | 68.97 | 93.07 | 114.11 | 67.45 |
In this case you can use MSupload
function, setting up sampleDataLines
argument to the number of rows with information about samples (3 for the table above). Change orientation of intensity matrix to "SamplesInRow"
, if necessary.
msdata <- MSupload("qi_to_ma_table.csv", sampleDataLines = 3, orientation = "SamplesInCol"))
| | S1T0 | S1T1 | S1T2 | S2T0 | S2T1 | S2T2 | S3T0 | S3T1 | S3T2 | S4T0 | S4T1 | S4T2 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | 3.8425/227 | 126.39 | 234.11 | 162.68 | 172.44 | 171.53 | 180.31 | 140.23 | 157.38 | 124.62 | 147.23 | 165.29 | 125.85 | | 4.6982/255 | 108.9 | 133.53 | 128.66 | 121.48 | 112.23 | 142.97 | 120.1 | 69.85 | 65.75 | 121.22 | 70.04 | 76.06 | | 4.0353/253 | 124.1 | 166.84 | 156.18 | 158.4 | 190.09 | 152.89 | 102.94 | 67.11 | 83.99 | 110.09 | 98.61 | 77.91 | | 4.8775/281 | 688.13 | 739.41 | 529.62 | 828.48 | 660.76 | 545.32 | 289.67 | 359.43 | 240.13 | 314.71 | 338.43 | 203.29 | | 4.2972/279 | 117.8 | 110.31 | 114.56 | 144.39 | 155.33 | 114.47 | 80.53 | 84.3 | 42.33 | 73.74 | 97.78 | 58 | | 7.9686/339 | 182.72 | 209.42 | 144.61 | 255.9 | 199.4 | 179.54 | 112.55 | 112 | 68.97 | 93.07 | 114.11 | 67.45 |
| Compound | m/z | Retention time (min) | Accepted Compound ID | Mass Error (ppm) | Retention Time Error (mins) | | --- | --- | --- | --- | --- | --- | | 3.8425/227 | 227.2032461 | 3.842533333 | FA 14:0 | 6.978613272 | -0.007466667 | | 4.6982/255 | 255.2317043 | 4.6982 | FA 16:0 | -4.87551823 | -0.0518 | | 4.0353/253 | 253.2197448 | 4.035366667 | FA 16:1 | 9.602793153 | -0.064633333 | | 4.8775/281 | 281.2483506 | 4.877583333 | FA 18:1 | -0.896403039 | -0.022416667 | | 4.2972/279 | 279.232409 | 4.297216667 | FA 18:2 | -1.943224345 | -0.102783333 | | 7.9686/339 | 339.3294283 | 7.968666667 | FA 22:0 | 7.565338681 | 0.018666667 |
| Sample | Phenotype | Time | |-------|----------|-------| | S1T0 | WT | 0 | | S1T1 | WT | 1 | | S1T2 | WT | 2 | | S2T0 | WT | 0 | | S2T1 | WT | 1 | | S2T2 | WT | 2 | | S3T0 | MT | 0 | | S3T1 | MT | 1 | | S3T2 | MT | 2 | | S4T0 | MT | 0 | | S4T1 | MT | 1 | | S4T2 | MT | 2 |
In this case you can use MSupload
function in this way, changing file paths to your own. Change orientation of intensity matrix to "SamplesInRow"
, if necessary. If there is no sample data file or peak data file, just set the file path to empty string sampleFile = ""
.
msdata <- MSupload(object = list(intFile = "intensity_matrix.csv", sampleFile = "sample_list.csv", peakFile = "peak_list.csv"), orientation = "SamplesInCol")
So, now you have object \code{msdata}. You can look through what's inside using these functions:
intMatrix(msdata) # Look at the intensity table. sampleData(msdata) # Look at the samples metadata. peakData(msdata) # Look at the peaks metadata.
Now you can apply different types of normalization/filtering/retention indexing/etc.
All the MSdata
package functions with their description, list of all possible arguments and default values of these arguments are listed in MSdata PDF manual
Remember, that you can use the functions below in different order!
There are several functions for different normalizations.
BiomassNorm
performs normalisation by the list of biomasses.msdata <- BiomassNorm(msdata, "biomass_list.txt")
StandNorm
performs normalisation by the sum intensity of certain standards. The list of the standards is provided as a file with three columns: compound, m/z, retention time. By these MZ and RT values corresponding peaks in dataset are determined and their intensities are used for normalisation.msdata <- StandNorm(msdata, "standards_list.txt")
DataNorm
performs other sample-wise normalisations (that makes): by concentration of one compound, by sum or median concentration of all compounds, etc.msdata <- DataNorm(msdata, method = "median")
DataTransform
performs different transformations (log, cube root).msdata <- DataTransform(msdata, method = "log2")
DataScaling
performs mean-centering and data rescaling to unify values and variance.msdata <- DataScaling(msdata, method = "pareto")
There are two functions.
BasicFilter
is identical to MetaboAnalyst filter. It performs ranking of all the compounds by different statistics and removes certain percent of them (depending on the total number) from dataset.msdata <- BasicFilter(msdata, method = "iqr")
PeakFilter
performs other types of filtering, described in ?PeakFilter
documentation. The default usage of this function is to remove compounds with more than certain number of NAs (40% in exaple below).msdata <- PeakFilter(msdata, min.nonNApercent = 0.4)
Methods of missing values imputations are identical to MetaboAnalyst and described in ?EvalMissVal
documentation.
msdata <- EvalMissVal(msdata, method = "ppca")
You should use MSdata_to_MA
function. Examples of usage:
dataSet <- MSdata_to_MA(msdata, designType = "regular", facA = "Phenotype")
dataSet <- MSdata_to_MA(msdata, designType = "regular", facA = "Phenotype", facB = "Treatment")
dataSet <- MSdata_to_MA(msdata, designType = "time", facA = "Phenotype", facB = "Time")
Instead of "Phenotype"
or "Treatment"
, of course, there could be any grouping factors from your sample metadata. (Kind reminder: use sampleData(msdata)
command to check sample metadata.)
Also, create an empty list for storing the analysis results:
analSet <- list()
REMEMBER: you always can read about different statistics, their outputs and etc. in detail in step-by-step MetaboAnalyst tutorial. Just open the file and use Ctrl+F.
All the MApckg
functions with their description, list of all possible arguments and default values of these arguments are listed in MApckg PDF manual
Also you can see the same documentation for particular function in R using the command \code{'?FunctionName'} like:
?ANOVA.Anal
Try it. You will see something like this:
ANOVA.Anal( dataSet, analSet, nonpar = FALSE, thresh = 0.05, post.hoc = "fisher")
That's the function with the list of arguments and their default values. Below you will see detailed description, what each parameter means:
|
nonpar
IfFALSE
- use classical ANOVA; ifTRUE
- Kruskal Wallis Test |thresh
Threshold of significance. |post.hoc
Post-hoc statistics:"tukey"
or"fisher"
All the function calls listed below use the default arguments, and they look like:
analSet <- ANOVA.Anal(dataSet, analSet)
But in case if you would like to use function with non-default parameters, just list these parameters too:
analSet <- ANOVA.Anal(dataSet, analSet, nonpar = TRUE, thresh = 0.01, post.hoc = "tukey")
By the way, you can always check the list of all statistical stuff you have done by the moment using this command:
str(analSet, max.level = 1)
analSet <- PCA.Anal(dataSet, analSet) analSet <- PCA.Loadings(dataSet, analSet) PlotPCA2DScore(dataSet, analSet) PlotPCABiplot(dataSet, analSet) PlotPCALoadings(dataSet, analSet)
analSet <- PLS.Anal(dataSet, analSet) analSet <- PLS.Loadings(dataSet, analSet) analSet <- PLSDA.CV(dataSet, analSet) analSet <- PLSDA.Permut(dataSet, analSet) PlotPLS2DScore(dataSet, analSet) PlotPLS.Classification(dataSet, analSet) PlotPLS.Permutation(dataSet, analSet) PlotPLSLoading(dataSet, analSet) PlotPLS.Imp(dataSet, analSet)
PlotCorrHeatMap(dataSet, analSet)
# Here you have to specify the name of the feature or pattern analSet <- FeatureCorrelation(dataSet, analSet, varName = "") analSet <- Match.Pattern(dataSet, analSet, pattern = "1-2-3-4") PlotCorr(dataSet, analSet)
analSet <- Kmeans.Anal(dataSet, analSet) PlotKmeans(dataSet, analSet)
analSet <- SOM.Anal(dataSet, analSet) PlotSOM(dataSet, analSet)
analSet <- PlotHCTree(dataSet, analSet)
analSet <- PlotSubHeatMap(dataSet, analSet)
analSet <- RF.Anal(dataSet, analSet) PlotRF.Classification(dataSet, analSet) PlotRF.VIP(dataSet, analSet) PlotRF.Outlier(dataSet, analSet)
analSet <- SAM.Anal(dataSet, analSet) analSet <- SetSAMSigMat(dataSet, analSet) PlotSAM.FDR(dataSet, analSet) PlotSAM.Cmpd(dataSet, analSet)
analSet <- Ttests.Anal(dataSet, analSet) PlotTT(dataSet, analSet)
# NB: don't perform FC on transformed or # rescaled data sets, it's senseless! analSet <- FC.Anal(dataSet, analSet) PlotFC(dataSet, analSet)
# NB: don't perform Volcano on transformed or # rescaled data sets, it's senseless! analSet <- Volcano.Anal(dataSet, analSet) PlotVolcano(dataSet, analSet)
analSet <- RSVM.Anal(dataSet, analSet) PlotRSVM.Classification(dataSet, analSet) PlotRSVM.Cmpd(dataSet, analSet)
analSet <- EBAM.A0.Anal(dataSet, analSet) analSet <- EBAM.Cmpd.Anal(dataSet, analSet) analSet <- SetEBAMSigMat(dataSet, analSet) PlotEBAM.A0(dataSet, analSet) PlotEBAM.Cmpd(dataSet, analSet)
analSet <- ANOVA.Anal(dataSet, analSet) PlotANOVA(dataSet, analSet)
analSet <- ANOVA2.Anal(dataSet, analSet) PlotANOVA2(dataSet, analSet)
analSet <- PlotHeatMap2(dataSet, analSet)
Plotting functions are meant above just after corresponding analytical function. However, actually, you can perform all the planned statistical analysis at once, not one by one. Just copy-paste necessary lines from the set of scripts.
Then, if you are interested in looking at particular plots, you can use corresponding functions one by one.
Otherwise, if you don't need to look at images in real-time, you could just copy-paste plotting functions too, and all the image files will be created at the same time.
See attached Template script
The best way is to look through the script and create your own one by choosing options, deleting unnecessary lines and setting all the parameters up.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.