clustering: Clustering

Description Usage Arguments Details Value Author(s) References

View source: R/clustering.R

Description

clustering is a wrapper of the RAMClustR::ramclustR from RAMClustR package. It performs a clustering of features with a given sigma for retention time similarity st and for correlation similarity sr. Note that, in addition to the sr, the argument deepSplit = TRUE might be critical to avoid several metabolites in a single cluster.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
clustering(
xcmsObj = NULL,
ms = NULL,
idmsms = NULL,
taglocation = "filepaths",
MStag = NULL,
idMSMStag = NULL,
featdelim = "_",
timepos = 2,
st = NULL,
sr = NULL,
maxt = NULL,
deepSplit = FALSE,
blocksize = 2000,
mult = 5,
hmax = NULL,
sampNameCol = 1,
collapse = TRUE,
usePheno = TRUE,
mspout = TRUE,
ExpDes = NULL,
normalize = "TIC",
qc.inj.range = 20,
order = NULL,
batch = NULL,
qc = NULL,
minModuleSize = 2,
linkage = "average",
mzdec = 3,
cor.method = "pearson",
rt.only.low.n = TRUE,
fftempdir = NULL,
replace.zeros = TRUE
)

Arguments

...

Arguments passed on to RAMClustR::ramclustR

xcmsObj

xcmsObject: containing grouped feature data for clustering by ramclustR

ms

filepath: optional csv input. Features as columns, rows as samples. Column header mz_rt

idmsms

filepath: optional idMSMS / MSe csv data. same dim and names as ms required

taglocation

character: "filepaths" by default, "phenoData[,1]" is another option. referse to xcms slot

MStag

character: character string in 'taglocation' to designat MS / MSe files e.g. "01.cdf"

idMSMStag

character: character string in 'taglocation' to designat idMSMS / MSe files e.g. "02.cdf"

featdelim

character: how feature mz and rt are delimited in csv import column header e.g. ="-"

timepos

integer: which position in delimited column header represents the retention time (csv only)

st

numeric: sigma t - time similarity decay value

sr

numeric: sigma r - correlational similarity decay value

maxt

numeric: maximum time difference to calculate retention similarity for - all values beyond this are assigned similarity of zero

deepSplit

logical: controls how agressively the HCA tree is cut - see ?cutreeDynamicTree

blocksize

integer: number of features (scans?) processed in one block =1000,

mult

numeric: internal value, can be used to influence processing speed/ram usage

hmax

numeric: precut the tree at this height, default 0.3 - see ?cutreeDynamicTree

sampNameCol

integer: which column from the csv file contains sample names?

collapse

logical: reduce feature intensities to spectrum intensities?

usePheno

logical: tranfer phenotype data from XCMS object to SpecAbund dataset?

mspout

logical: write msp formatted specta to file?

ExpDes

either an R object created by R ExpDes object: data used for record keeping and labelling msp spectral output

normalize

character: either "none", "TIC", "quantile", or "batch.qc" normalization of feature intensities. see batch.qc overview in details.

qc.inj.range

integer: how many injections around each injection are to be scanned for presence of QC samples when using batch.qc normalization? A good rule of thumb is between 1 and 3 times the typical injection span between QC injections. i.e. if you inject QC ever 7 samples, set this to between 7 and 21. smaller values provide more local precision but make normalization sensitive to individual poor outliers (though these are first removed using the boxplot function outlier detection), while wider values provide less local precision in normalization but better stability to individual peak areas.

order

integer vector with length equal to number of injections in xset or csv file

batch

integer vector with length equal to number of injections in xset or csv file

qc

logical vector with length equal to number of injections in xset or csv file.

minModuleSize

integer: how many features must be part of a cluster to be returned? default = 2

linkage

character: heirarchical clustering linkage method - see ?hclust

mzdec

integer: number of decimal places used in printing m/z values

cor.method

character: which correlational method used to calculate 'r' - see ?cor

rt.only.low.n

logical: default = TRUE At low injection numbers, correlational relationships of peak intensities may be unreliable. by defualt ramclustR will simply ignore the correlational r value and cluster on retention time alone. if you wish to use correlation with at n < 5, set this value to FALSE.

fftempdir

valid path: if there are file size limitations on the default ff pacakge temp directory - getOptions('fftempdir') - you can change the directory used as the fftempdir with this option.

replace.zeros

logincal: TRUE by default. NA, NaN, and Inf values are replaced with zero, and zero values are sometimes returned from peak peaking. When TRUE, zero values will be replaced with a small amount of noise, with noise level set based on the detected signal intensities for that feature.

Details

Main clustering function output - see citation for algorithm description or vignette('RAMClustR') for a walk through. batch.qc. normalization requires input of three vectors (1) batch (2) order (3) qc. This is a feature centric normalization approach which adjusts signal intensities first by comparing batch median intensity of each feature (one feature at a time) QC signal intensity to full dataset median to correct for systematic batch effects and then secondly to apply a local QC median vs global median sample correction to correct for run order effects.

Value

$featclus: integer vector of cluster membership for each feature

$frt: feature retention time, in whatever units were fed in (xcms uses seconds, by default)

$fmz: feature retention time, reported in number of decimal points selected in ramclustR function

$xcmsOrd: the original XCMS (or csv) feature order for cross referencing, if need be

$clrt: cluster retention time

$clrtsd: retention time standard deviation of all the features that comprise that cluster

$nfeat: number of features in the cluster

$nsing: number of 'singletons' - that is the number of features which clustered with no other feature

$ExpDes: the experimental design object used when running ramclustR. List of two dataframes.

$cmpd: compound name. C#### are assigned in order of output by dynamicTreeCut. Compound with the most features is classified as C0001...

$ann: annotation. By default, annotation names are identical to 'cmpd' names. This slot is a placeholder for when annotations are provided

$MSdata: the MSdataset provided by either xcms or csv input

$MSMSdata: the (optional) MSe/idMSMS dataset provided be either xcms or csv input

$SpecAbund: the cluster intensities after collapsing features to clusters

$SpecAbundAve: the cluster intensities after averaging all samples with identical sample names

- 'spectra' directory is created in the working directory. In this directory a .msp is (optionally) created, which contains the spectra for all compounds in the dataset following clustering. if MSe/idMSMS data are provided, they are listed witht he same compound name as the MS spectrum, with the collision energy provided in the ExpDes object provided to distinguish low from high CE spectra.

Author(s)

Corey Broeckling

References

Broeckling CD, Afsar FA, Neumann S, Ben-Hur A, Prenni JE. RAMClust: a novel feature clustering method enables spectral-matching-based annotation for metabolomics data. Anal Chem. 2014 Jul 15;86(14):6812-7. doi: 10.1021/ac501530d. Epub 2014 Jun 26. PubMed PMID: 24927477.

Broeckling CD, Ganna A, Layer M, Brown K, Sutton B, Ingelsson E, Peers G, Prenni JE. Enabling Efficient and Confident Annotation of LC-MS Metabolomics Data through MS1 Spectrum and Time Prediction. Anal Chem. 2016 Sep 20;88(18):9226-34. doi: 10.1021/acs.analchem.6b02479. Epub 2016 Sep 8. PubMed PMID: 7560453.


sipss/AlpsLCMS documentation built on May 13, 2021, 6:18 p.m.