Description Usage Arguments Details Value References See Also Examples
Harman is a PCA and constrained optimisation based technique that maximises the removal of batch effects from datasets, with the constraint that the probability of overcorrection (i.e. removing genuine biological signal along with batch noise) is kept to a fraction which is set by the end-user (Oytam et al, 2016; http://dx.doi.org/10.1186/s12859-016-1212-5).
Harman expects unbounded data, so for example, with HumanMethylation450 arrays do not use the Beta statistic (with values constrained between 0 and 1), instead use the logit transformed M-values.
1 2 |
datamatrix |
matrix or data.frame, the data values to correct with
samples in columns and data values in rows. Internally, a data.frame will be
coerced to a matrix. Matrices need to be of type |
expt |
vector or factor with the experimental variable of interest (variance to be kept). |
batch |
vector or factor with the batch variable (variance to be removed). |
limit |
numeric, confidence limit. Indicates the limit of confidence in
which to stop removing a batch effect. Must be between |
numrepeats |
integer, the number of repeats in which to run the simulated batch mean distribution estimator using the random selection algorithm. (N.B. 32 bit Windows versions may have an upper limit of 300000 before catastrophic failure) |
randseed |
integer, the seed for random number generation. |
forceRand |
logical, to enforce Harman to use a random selection algorithm to compute corrections. Force the simulated mean code to use random selection of scores to create the simulated batch mean (rather than full explicit calculation from all permutations). |
printInfo |
logical, whether to print information during computation or not. |
The datamatrix
needs to be of type integer
or
numeric
, or alternatively a data.frame that can be coerced into one
using as.matrix
. The matrix is to be constructed with data
values (typically microarray probes or sequencing counts) in rows and samples
in columns, much like the 'assayData' slot in the canonical Bioconductor
eSet
object, or any object which inherits from it. The data should
have normalisation and any other global adjustment for noise reduction
(such as background correction) applied prior to using Harman.
For converge, the number of simulations, numrepeats
parameter should
probably should be at least 100,000. The underlying principle of Harman rests
upon PCA, which is a parametric technique. This implies Harman should be
optimal when the data is normally distributed. However, PCA is known to be
rather robust to very non-normal data.
A harmanresults
S3 object.
Oytam et al (2016) BMC Bioinformatics 17:1. DOI: 10.1186/s12859-016-1212-5
harman
, reconstructData
,
pcaPlot
, arrowPlot
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | library(HarmanData)
data(OLF)
expt <- olf.info$Treatment
batch <- olf.info$Batch
olf.harman <- harman(olf.data, expt, batch)
plot(olf.harman)
olf.data.corrected <- reconstructData(olf.harman)
## Reading from a csv file
datafile <- system.file("extdata", "NPM_data_first_1000_rows.csv.gz",
package="Harman")
infofile <- system.file("extdata", "NPM_info.csv.gz", package="Harman")
datamatrix <- read.table(datafile, header=TRUE, sep=",", row.names="probeID")
batches <- read.table(infofile, header=TRUE, sep=",", row.names="Sample")
res <- harman(datamatrix, expt=batches$Treatment, batch=batches$Batch)
arrowPlot(res, 1, 3)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.