Description Usage Arguments Details Value Author(s) References See Also Examples
These functions parallelize several segmentation algorithms and make their calling use the same conventions as for other methods.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 | pSegmentDNAcopy(cghRDataName, chromRDataName, merging = "MAD",
mad.threshold = 3, smooth = TRUE,
alpha=0.01, nperm=10000,
p.method = "hybrid",
min.width = 2,
kmax=25, nmin=200,
eta = 0.05,trim = 0.025,
undo.splits = "none",
undo.prune=0.05, undo.SD=3,
typeParall = "fork",
mc.cores = detectCores(),
certain_noNA = FALSE,
loadBalance = TRUE,
...)
pSegmentHaarSeg(cghRDataName, chromRDataName,
merging = "MAD", mad.threshold = 3,
W = vector(),
rawI = vector(),
breaksFdrQ = 0.001,
haarStartLevel = 1,
haarEndLevel = 5,
typeParall = "fork",
mc.cores = detectCores(),
certain_noNA = FALSE,
loadBalance = FALSE,
...)
pSegmentHMM(cghRDataName, chromRDataName,
merging = "mergeLevels", mad.threshold = 3,
aic.or.bic = "AIC",
typeParall = "fork",
mc.cores = detectCores(),
certain_noNA = FALSE,
loadBalance = TRUE,
...)
pSegmentBioHMM(cghRDataName, chromRDataName, posRDataName,
merging = "mergeLevels", mad.threshold = 3,
aic.or.bic = "AIC",
typeParall = "fork",
mc.cores = detectCores(),
certain_noNA = FALSE,
loadBalance = TRUE,
...)
pSegmentCGHseg(cghRDataName, chromRDataName, CGHseg.thres = -0.05,
merging = "MAD", mad.threshold = 3,
typeParall = "fork",
mc.cores = detectCores(),
certain_noNA = FALSE,
loadBalance = TRUE,
...)
pSegmentGLAD(cghRDataName, chromRDataName,
deltaN = 0.10,
forceGL = c(-0.15, 0.15),
deletion = -5,
amplicon = 1,
typeParall = "fork",
mc.cores = detectCores(),
certain_noNA = FALSE,
GLADdetails = FALSE,
loadBalance = TRUE,
...)
pSegmentWavelets(cghRDataName, chromRDataName, merging = "MAD",
mad.threshold = 3,
minDiff = 0.25,
minMergeDiff = 0.05,
thrLvl = 3, initClusterLevels = 10,
typeParall = "fork",
mc.cores = detectCores(),
certain_noNA = FALSE,
loadBalance = TRUE,
...)
|
cghRDataName |
The Rdata file name that contains the
If this is an Note that the type of object in |
chromRDataName |
The RData file name with the ff (short integer)
vector with the chromosome indicator, or the name of the in-memory
RAM R object with the data. Function
|
posRDataName |
The RData file name with the ff double vector with
the location (e.g., position in kbases) of each probe in the
chromosome, or the name of the in-memory RAM R object with the
data. Function |
merging |
Merging method; for most methods one of "MAD" or "mergeLevels". For CBS (pSegmentDNAcopy), GGHseg (pSegmentCGHseg), HaarSeg (pSegmentHaarSeg), and Wavelets (as in Hsu et al. —pSegmentWavelets) also "none". This option does not apply to GLAD (which has its own merging-like approach). See details. |
mad.threshold |
If using |
typeParall |
One of "fork" or "cluster". "fork" is unavailable in
Windows, and will lead to sequential execution. "cluster" requires
having set up a cluster before, with appropriate calls to
Using "fork" and "cluster" will lead to different schemes for parallelization. See details and the vignette. |
mc.cores |
The number of cores used if |
.
certain_noNA |
Are you certain, absolutely sure, your data contain no missing values? (Default is FALSE). If you are, you can achieve considerable speed ups by setting it to TRUE. But if you set it to TRUE and you are wrong, some methods will fail (some with harder to understand error messages) and, even worse, other methods might appear to work (but give incorrect results). You've been warned. |
loadBalance |
If TRUE (the default for all methods except HaarSeg)
use load balancing with MPI (use
|
smooth |
For DNAcopy only. If TRUE (default) carry out smoothing as
explained in |
alpha |
For DNAcopy only. See |
nperm |
For DNAcopy only. See |
p.method |
For DNAcopy only. See |
min.width |
For DNAcopy only. See |
kmax |
For DNAcopy only. See |
nmin |
For DNAcopy only. See |
eta |
For DNAcopy only. See |
trim |
For DNAcopy only. See |
undo.splits |
For DNAcopy only. See |
undo.prune |
For DNAcopy only. See |
undo.SD |
For DNAcopy only. See |
W |
For HaarSeg: Weight matrix, corresponding to quality of measurment. Insert 1/(sigma**2) as weights if your platform output sigma as the quality of measurment. W must have the same size as I. |
rawI |
For HaarSeg. Mininum of the raw red and raw green measurment, before the log. rawI is used for the non-stationary variance compensation. rawI must have the same size as I. |
breaksFdrQ |
For HaarSeg. The FDR q parameter. Common used values are 0.05, 0.01, 0.001. Default value is 0.001. |
haarStartLevel |
For HaarSeg. The detail subband from which we start to detect peaks. The higher this value is, the less sensitive we are to short segments. The default is value is 1, corresponding to segments of 2 probes. |
haarEndLevel |
For HaarSeg. The detail subband until which we use to detect peaks. The higher this value is, the more sensitive we re to large trends in the data. This value DOES NOT indicate the largest possible segment that can be detected. The default is value is 5, corresponding to step of 32 probes in each direction. |
aic.or.bic |
For HMM and BioHMM. One of "AIC" or "BIC". See
criteria in |
CGHseg.thres |
The threshold for the adaptive penalization in Picard et al.'s CGHseg. See p. 13 of the original paper. Must be a negative number. The default value used in the original reference is -0.5. However, our experience with the simulated data in Willenbrock and Fridlyand (2005) indicates that for those data values around -0.005 are more appropriate. We use here -0.05 as default. |
deltaN |
Only for GLAD. See deltaN in |
forceGL |
Only for GLAD. See forceGL in |
deletion |
Only for GLAD. See deletion in |
amplicon |
Only for GLAD. See amplicon in
|
GLADdetails |
Only for GLAD. If set to TRUE the function returns verbose output about where it is along the execution. This option (setting it to FALSE) is likely to become hard-coded in the future. |
minMergeDiff |
Used only when doing merging in the wavelet
method of Hsu et al.. The finall call as to which segments go together is done by
a |
minDiff |
For Wavelets (Hsu et al.). Minimum (absolute) difference between the medians of two adjacent clusters for them to be considered truly different. Clusters "closer" together than this are collapsed together to form a single cluster. |
thrLvl |
The level used for the wavelet thresholding in Hsu et al. |
initClusterLevels |
For Wavelets (Hsu et al.). The initial number of clusters to form. |
... |
Additional arguments; not used. |
In most cases, these are wrappers to the original code, with
modifications for parallelization and for using ff
objects, if appropriate.
Using option typeParall = "fork"
will, as it says, use the
forking mechanism available in package parallel
. The objects used
can be either ff
objects or regular R objects. Using
typeParall = "cluster"
will use a pre-existing cluster, and the
objects used must be ff
ones, since we only pass pointers to the
objects, not the objects themselves, to try to minimize communication
and memory usage. To put it the other way around: if you use RAM
objects, you must use typeParall = "fork"
; with ff
objects
you can use both typeParall = "cluster"
and typeParall =
"fork"
. Further details are provided in the vignette.
For HMM, BioHMM, CGHseg, and Wavelets, the first part of the analysis is conducted parallelizing over array by chromosome (because the methods are slow and/or very memory consuming). The final step (merging), however, is carried out over array (it is a step that must be carried array-wise). For all other methods, we have parallelized over arrays: the extra communication overheads of the much finer-grained parallelization of array by chromosome are rarely justified with these methods and, in the case of GLAD, would require modifying the original C code.
CGHseg has been implemented here following the original authors description. Note that several publications incorrectly claim that they use the CGHseg approach when, actually, they are only using the "segment" function in the "tilingArray" package, but they are missing the key step of choosing the optimal number of segments (see p. 13 in Picard et al, 2005). We implement the author's method in our (internal, so use "ADaCGH2:::piccardsKO" to see it) function "piccardsKO".
When using GLAD, we use the HaarSeg approach. This is the same as using
the daglad
function with argument smoothfunc = "haarseg"
.
For BioHMM and HMM the smoothed results are merged, by default by the mergeLevels algorithm, as recommended in Willenbrock and Fridlyand, 2005. For DNAcopy the default used to be mergeLevels, following the above recommendations, but we are now using MAD by default, as it is much faster and it is unclear that mergeLevels is the right approach with the type of data available today. Your mileage might vary and you probably will want to try both on some test data and check which makes more sense.
Merging is also done in GLAD (with GLAD's own merging algorithm). For HaarSeg, calling/merging is carried out using MAD, following page i141 of Ben-Yaacov and Eldar, section 2.3, "Determining aberrant intervals": a MAD (per their definition) is computed and any segment with absolute value larger than mad.threshold * MAD is considered aberrant. Merging is also performed for CGHseg (the default, however, is MAD, not mergeLevels). Merging (using either of "mergeLevels" or "MAD") can also be used with the wavelet-based method of Hsu et al.; please note that the later is an experimental feature implemented by us, and there is no study of its performance.
In summary, for all segmentation methods (except GLAD) merging is available as either "mergeLevels" or "MAD". For DNAcopy, CGHseg, HaarSeg, and wavelets as in Hsu et al., you can also choose no merging, though this will rarely be what you want (we offer this option to allow using the original authors' choices in their first descriptions of methods).
When using mergeLevels, we map the results to states of "Alteration", so that we categorize each probe as taking one, and only one, of three possible values, -1 (loss of genomic DNA), 0 (no change in DNA content), +1 (gain of genomic DNA). We have made the assumption, in this mapping, that the "no change" class is the one that has the absolute value closest to zero, and any other classes are either gains or losses. When the data are normalized, the "no change" class should be the most common one. When using MAD this step is implicit in the procedure ( any segment with absolute value larger than mad.threshold * MAD is considered aberrant).
Note that "mergeLevels", in addition to being used for calling gains and losses, results in a decrease in the number of distinct smoothed values, since it can merge two or more adjacent smoothed levels. "MAD", in contrast, performs no merging as such, but only calling.
A list of two components (the components will be either
ff
or regular, in-memory R objects, depending on the input):
outSmoothed |
The smoothed values, as either a
|
outState |
The calls for each probe, as either an
|
If the output uses ffdf
, rows and columns of each
element can be accessed in the usual way for ffdf
objects, but accept also most of the usual R operations for data
frames.
The code for DNAcopy, HMM, BioHMM, and GLAD are basically wrappers
around the original functions by their corresponding authors, with some
modiffications for parallelization and usage of ff objects. The original
packages are: DNAcopy
, aCGH
, snapCGH
, cgh
,
GLAD
, respectively. The CGHseg method uses package
tilingArray
.
HaarSeg has been turned into an R package, available from https://r-forge.r-project.org/projects/haarseg/. That package uses, at its core, the same R and C code as we do, from Ben-Yaacov and Eldar. We have not used the available R package for historical reasons (we used Eldar and Ben-Yaacov's C and R code in the former ADaCGH package, before a proper R package was available).
For the wavelet-based method we have only wrapped the code that was kindly provided by L. Hsu and D. Grove, and parallelized a few calls. Their original code is included in the sources of the package.
Parallelization and modifications for using ff and additions are by Ramon Diaz-Uriarte rdiaz02@gmail.com
Diaz-Uriarte, R. (2014). ADaCGH2: parallelized analysis of (big) CNA data. Bioinformatics, 30: 1759–1761.
Carro A, Rico D, Rueda O M, Diaz-Uriarte R, and Pisano DG. (2010). waviCGH: a web application for the analysis and visualization of genomic copy number alterations. Nucleic Acids Research, 38 Suppl:W182–187.
Fridlyand, Jane and Snijders, Antoine M. and Pinkel, Dan and Albertson, Donna G. (2004). Hidden Markov models approach to the analysis of array CGH data. Journal of Multivariate Analysis, 90: 132–153.
Hsu L, Self SG, Grove D, Randolph T, Wang K, Delrow JJ, Loo L, Porter P. (2005) Denoising array-based comparative genomic hybridization data using wavelets. Biostatistics, 6:211-26.
Hupe, P. and Stransky, N. and Thiery, J. P. and Radvanyi, F. and Barillot, E. (2004). Analysis of array CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics, 20: 3413–3422.
Lingjaerde OC, Baumbusch LO, Liestol K, Glad I, Borresen-Dale AL. (2005). CGH-Explorer: a program for analysis of CGH-data. Bioinformatics, 21: 821–822.
Marioni, J. C. and Thorne, N. P. and Tavare, S. (2006). BioHMM: a heterogeneous hidden Markov model for segmenting array CGH data. Bioinformatics, 22: 1144–1146.
Olshen, A. B. and Venkatraman, E. S. and Lucito, R. and Wigler, M. (2004) Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics, 4, 557–572. http://www.mskcc.org/biostat/~olshena/research.
Picard, F. and Robin, S. and Lavielle, M. and Vaisse, C. and Daudin, J. J. (2005). A statistical approach for array CGH data analysis. BMC Bioinformatics, 6, 27. http://dx.doi.org/10.1186/1471-2105-6-27.
Price TS, Regan R, Mott R, Hedman A, Honey B, Daniels RJ, Smith L, Greenfield A, Tiganescu A, Buckle V, Ventress N, Ayyub H, Salhan A, Pedraza-Diaz S, Broxholme J, Ragoussis J, Higgs DR, Flint J, Knight SJ. (2005) SW-ARRAY: a dynamic programming solution for the identification of copy-number changes in genomic DNA using array comparative genome hybridization data. Nucleic Acids Res., 33:3455-64.
Willenbrock, H. and Fridlyand, J. (2005). A comparison study: applying segmentation to array CGH data for downstream analyses. Bioinformatics, 21, 4084–4091.
Diaz-Uriarte, R. and Rueda, O.M. (2007). ADaCGH: A parallelized web-based application and R package for the analysis of aCGH data, PLoS ONE, 2: e737.
Ben-Yaacov, E. and Eldar, Y.C. (2008). A Fast and Flexible Method for the Segmentation of aCGH Data, Bioinformatics, 24: i139-i145.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 | #####################################################
###
### Using forking with RAM objects
###
#####################################################
### Note to windows users: under Windows, this will
### result in sequential execution, as forking is not
### available.
## Get example input data and create data objects
data(inputEx)
## (this is not necessary, but is convenient;
## you could do the subsetting in the call themselves)
cgh.dat <- inputEx[, -c(1, 2, 3)]
chrom.dat <- as.integer(inputEx[, 2])
pos.dat <- inputEx[, 3]
## Segment with HaarSeg
## You might want to adapt mc.cores to your hardware
haar.RAM.fork <- pSegmentHaarSeg(cgh.dat, chrom.dat,
merging = "MAD",
mc.cores = 2)
## What does the output look like?
lapply(haar.RAM.fork, head)
## Where and what length are segments in first sample?
rle(haar.RAM.fork$outSmoothed[, 1])
## Repeat, without load-balancing
## You might want to adapt mc.cores to your hardware
haar.RAM.fork.nlb <- pSegmentHaarSeg(cgh.dat, chrom.dat,
merging = "MAD",
loadBalance = FALSE,
mc.cores = 2)
if(.Platform$OS.type != "windows") {
## We do not want this to run in Windows the automated tests since
## issues with I/O. It should work, though, in interactive usage
#####################################################
###
### Using forking with ff objects
###
#####################################################
### Note to windows users: under Windows, this will
### result in sequential execution, as forking is not
### available.
## Create a temp dir for storing output and ff objects.
## (Not needed, but cleaner).
dir.create("ADaCGH2_example_tmp_dir")
originalDir <- getwd()
setwd("ADaCGH2_example_tmp_dir")
## Get input data in ff format
## (we loaded the RData above, but we need to find the full path
## to use it in the call to inputToADaCGH)
fname <- list.files(path = system.file("data", package = "ADaCGH2"),
full.names = TRUE, pattern = "inputEx.RData")
inputToADaCGH(ff.or.RAM = "ff",
RDatafilename = fname)
## Segment with HaarSeg
## You might want to adapt mc.cores to your hardware
haar.ff.fork <- pSegmentHaarSeg("cghData.RData",
"chromData.RData",
merging = "MAD",
mc.cores = 2)
## What does the output look like?
haar.ff.fork
## Note the warnings; we will be gentler in next example.
#####################################################
###
### Using a cluster with ff objects
###
#####################################################
## Start a socket cluster. Change the appropriate number of CPUs
## for your hardware and use other types of clusters (e.g., MPI)
## if you want.
cl2 <- parallel::makeCluster(2,"PSOCK")
parallel::clusterSetRNGStream(cl2)
parallel::setDefaultCluster(cl2)
parallel::clusterEvalQ(NULL, library("ADaCGH2"))
## The following is not really needed if you create the cluster AFTER
## changing directories. But better to be explicit.
wdir <- getwd()
parallel::clusterExport(NULL, "wdir")
parallel::clusterEvalQ(NULL, setwd(wdir))
## Segment with HaarSeg
haar.ff.cluster <- pSegmentHaarSeg("cghData.RData",
"chromData.RData",
merging = "MAD",
typeParall= "cluster")
## Avoid warnings by opening the objects
names(haar.ff.cluster)
open(haar.ff.cluster$outSmoothed)
open(haar.ff.cluster$outState)
## Alternatively, we can open the two ffdfs with lapply
## lapply(haar.ff.cluster, open)
##########################################
###
### Compare output (should be identical)
###
##########################################
all.equal(haar.ff.cluster$outSmoothed[ , ],
haar.ff.fork$outSmoothed[ , ])
all.equal(haar.ff.cluster$outSmoothed[ , ],
haar.RAM.fork$outSmoothed[ , ])
identical(haar.ff.cluster$outState[ , ],
haar.ff.fork$outState[ , ])
identical(haar.ff.cluster$outState[ , ],
haar.RAM.fork$outState[ , ])
#####################################################################
####
#### Clean up actions
####
#### (These are not needed. They are convenient here, to prevent
#### leaving garbage in your hard drive. In "real life" you will
#### have to decide what to delete and what to store).
#####################################################################
### Explicitly stop cluster
parallel::stopCluster(cl2)
### All objects (RData and ff) are left in this directory
getwd()
### We will clean it up, and do it step-by-step
### BEWARE: DO NOT do this with objects you want to keep!!!
## Remove ff and RData for the data
load("chromData.RData")
load("posData.RData")
load("cghData.RData")
delete(cghData); rm(cghData)
delete(posData); rm(posData)
delete(chromData); rm(chromData)
unlink("chromData.RData")
unlink("posData.RData")
unlink("cghData.RData")
unlink("probeNames.RData")
## Remove ff and R objects with segmentation results
lapply(haar.ff.fork, delete)
rm(haar.ff.fork)
lapply(haar.ff.cluster, delete)
rm(haar.ff.cluster)
### Try to prevent problems in R CMD check
## Sys.sleep(2)
### Delete temp dir
setwd(originalDir)
## Sys.sleep(2)
unlink("ADaCGH2_example_tmp_dir", recursive = TRUE)
## Sys.sleep(2)
}
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.