inputToADaCGH: Convert CGH data to ff or RAM objects for use with ADaCGH2

Description Usage Arguments Details Value Author(s) See Also Examples

View source: R/ADaCGH-2.R

Description

Input data with CGH data are converted to several ff files and data checked for potential errors and location duplications.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
inputToADaCGH(ff.or.RAM = "RAM",
                      robjnames = c("cgh.dat", "chrom.dat",
                                   "pos.dat", "probenames.dat"),
                      ffpattern = paste(getwd(), "/", sep = ""),
                      MAList = NULL,
                      cloneinfo = NULL,
                      RDatafilename = NULL,
                      textfilename = NULL,
                      dataframe = NULL,
                      path = NULL,
                      excludefiles = NULL,
                      cloneinfosep = "\t",
                      cloneinfoquote = "\"",
                      minNumPerChrom = 10,
                      verbose = FALSE,
                      mc.cores = floor(detectCores()/2))

Arguments

ff.or.RAM

Whether you want to store the output as ff or RAM objects ("usual" R objects, such as data frames and vectors).

robjnames

Name of the objects that will be created in you use ff.or.RAM = "RAM". If the names existing in the environment from where the function is called, they will not be overwritten, and the function will abort.

ffpattern

See argument pattern in ff. The default is to create the "ff" files in the current working directory.

MAList

The name of an object of class MAList (as.MAList) or SegList (e.g., dim.SegList). See vignnettes for these packages (limma and snapCGH) for details about these objects.

You have to specify one, and only one, of MAList, RDatafilename, textfilename, path.

cloneinfo

A character vector with the full path to a file that conforms to the characteristics of file in function read.clonesinfo (see details in the vignette) or the name of a data frame with at least a column named "Chr" (with chromosomal informtaion) and "Position".

This is only needed if you use MAList and your MAList object does not have Position and Chr columns.

RDatafilename

Name of data RData file that contains the data frame with original, non-ff, data. Note: this is the name of the RData file (possibly including path), NOT the name of the data frame. (For that, look at dataframe).

The first three columns of the data frame are the IDs of the probes, the chromosome number, and the position, and all remaining columns contain the data for the arrays, one column per array. The names of the first three column do not matter, but the order does. Names of the remaining columns will be used if existing; otherwise, fake array names will be created.

You have to specify one, and only one, of MAList, RDatafilename, textfilename, path.

textfilename

The name of a text file with the data. It should be a tab separated file, with a header. The first three columns of the data frame are the IDs of the probes, the chromosome number, and the position, and all remaining columns contain the data for the arrays, one column per array. The names of the first three column do not matter, but the order does. Names of the remaining columns will be used if existing; otherwise, fake array names will be created.

You have to specify one, and only one, of MAList, RDatafilename, textfilename, path.

dataframe

The name of a data frame with the data. The first three columns of the data frame are the IDs of the probes, the chromosome number, and the position, and all remaining columns contain the data for the arrays, one column per array. The names of the first three column do not matter, but the order does. Names of the remaining columns will be used if existing; otherwise, fake array names will be created.

path

The name of the directory (the full path) to where each of the individual, one-column, files are. We will read ALL of the files in this directory, except for those listed under excludefiles. One file has to be named "ID.txt", another "Chrom.txt", and a third "Pos.txt". The rest of the files can be named any way you want and those are the files that contain the CGH data.

All of the files are expected to be one-column text files, with a first row with a header. The header will not be used for "ID.txt", "Chrom.txt", or "Pos.txt", but the header will be used as the name of the array/subject for the CGH data files.

You have to specify one, and only one, of MAList, RDatafilename, textfilename, path.

excludefiles

If you have specified path, names of files not to be read. A vector of strings. These should be the names of the files, without path information (as all of the files are in the same directory, as specified by path).

cloneinfosep

Argument to read.table if reading a cloneinfo file. Note: this is NOT used if reading a text file given in textfilename.

cloneinfoquote

Argument to read.table if reading a cloneinfo file. Note: this is NOT used if reading a text file given in textfilename.

minNumPerChrom

If any chromosome has fewer observations than minNumPerChrom the function will fail. This can help detect upstream pre-processing errors.

verbose

If TRUE, provide additional information that can be useful to debug problems. Right now it provides the list of files that will be read if using a directory. The default is FALSE.

mc.cores

The number of cores to use when reading files. This is always 1 in Windows. See details about the number of cores in mclapply and detectCores. Contention problems in I/O might be minimized by making this number smaller or much smaller than what is returned by detectCores. For long running jobs, please do some initial benchmarking. See comments and discussion in file "benchmark.pdf". In general, if you have a single SATA disk make this a small number (say, 2 or 3 or 6); in contrast, if you have many fast SAS disks in a RAID0 or RAID10 array, you can increase the number quite a bit (but generally always well below what detectCores gives).

Details

If there are identical positions (in the same chromosome) a small random uniform variate is added to get unique locations.

We carry out several checks (e.g., no duplicated positions), but note that we DO NOT check for extremely large or small values, and this includes NOT CHECKING for infinite values.

Missing values are allowed in the data columns. However, we do not check for missing values in the ID, chromosome, or position columns, except if you are using as input an RData file or MA list. You better not have any missing values there; otherwise, things will break in strange ways. Why this inconsistency? Checking for missing values can consume a lot of resources (CPU and memory). If your are really huge, they will probably be stored as text files, and you are expected to use the appropriate tools there to filter (e.g., sed, awk, whatever). If they exist as an MA list or an RData file, they once fitted in RAM, so checking for these NAs is probably reasonable.

If you provide a text file as input (textfilename), the reading operation is carried out using read.table.ffdf, to allow for reading very large files. Using this option, however, does not force you to produce as output ff objects.

Commented examples of reading objects from limma and snapCGH are provided in the vignnette.

Value

This function is used mainly for its side effects: writing either several ff files to the current working directory, or several RAM objects (the usual, in memory, local, R objects). The actual names are printed out.

Author(s)

Ramon Diaz-Uriarte rdiaz02@gmail.com

See Also

cutFile for obtaining files in the format needed if you read from a directory.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
## Create a temp dir for storing output.
## (Not needed, but cleaner).
dir.create("ADaCGH2_example_input_dir")
originalDir <- getwd()
setwd("ADaCGH2_example_input_dir")
## Sys.sleep(1)

## Get location (and full filename) of example data file
fnameRData <- list.files(path = system.file("data", package = "ADaCGH2"),
                         full.names = TRUE, pattern = "inputEx.RData")

fnametxt <- list.files(path = system.file("data", package = "ADaCGH2"),
                         full.names = TRUE, pattern = "inputEx.txt")

namepath <- system.file("example-datadir", package = "ADaCGH2")

## Read from RData and write to ff
inputToADaCGH(ff.or.RAM = "ff",
              RDatafilename = fnameRData)

## Read from text file and write to ff
##    You might want to adapt mc.cores to your hardware
inputToADaCGH(ff.or.RAM = "ff",
              textfilename = fnametxt,
              mc.cores = 2)


## Read from text file and write to RAM
##    You might want to adapt mc.cores to your hardware
inputToADaCGH(ff.or.RAM = "RAM",
              textfilename = fnametxt,
              mc.cores = 2)

## Read from a directory and write to ff
##    You might want to adapt mc.cores to your hardware
inputToADaCGH(ff.or.RAM = "ff",
              path = namepath,
              mc.cores = 2)

### Clean up (DO NOT do this with objects you want to keep!!!)
load("chromData.RData")
load("posData.RData")
load("cghData.RData")

delete(cghData); rm(cghData)
delete(posData); rm(posData)
delete(chromData); rm(chromData)
unlink("chromData.RData")
unlink("posData.RData")
unlink("cghData.RData")
unlink("probeNames.RData")


### Running in a separate process. Only makes sense
### if returning ff objects (ff.or.RAM = "ff")
### This example will not work on Windows

## Not run: 
mcparallel(inputToADaCGH(ff.or.RAM = "ff",
                                 RDatafilename = fnameRData),
                                 silent = FALSE)
  tableChromArray <- mccollect()
  if(inherits(tableChromArray, "try-error")) {
    stop("ERROR in input data conversion")
  }
### Clean up (DO NOT do this with objects you want to keep!!!)
load("chromData.RData")
load("posData.RData")
load("cghData.RData")

delete(cghData); rm(cghData)
delete(posData); rm(posData)
delete(chromData); rm(chromData)
unlink("chromData.RData")
unlink("posData.RData")
unlink("cghData.RData")
unlink("probeNames.RData")

## End(Not run)

### Try to prevent problems in R CMD check
## Sys.sleep(2)

### Delete temp dir
setwd(originalDir)
## Sys.sleep(2)
unlink("ADaCGH2_example_input_dir", recursive = TRUE)
## Sys.sleep(2)

ADaCGH2 documentation built on Nov. 8, 2020, 4:57 p.m.