process: cghRA array processing
In cghRA: Array CGH Data Analysis and Visualization

Description Usage Arguments Value Processing steps Author(s) See Also

These functions implement the cghRA workflow, as a sequence of process subfunction calls. Each of them rely on cghRA.array and cghRA.regions methods, so custom processing can be easily achieved using them directly if the steps argument is not flexible enough to your purpose.

Custom steps can be added as well on the model of existing ones, defining a function called process.NAME and adding "NAME" to the steps vector during the call to process. Step functions need to handle at least an input parameter which will be returned directly by the previous step, thus forming a pipeline.

The tk.process function is a wrapper for process, built around a Tcl-Tk interface for more user-friendliness.

The process function is a multi-core command line interface that will dispatch its arguments to individual process.core calls, and should be the prefered entry point even on single core computers. process.log is a wrapper to process.core which captures warnings and errors into a log file.

The process.default function is a common way for process and tk.process to obtain default values for complex arguments like 'segmentArgs' and 'modelizeArgs'. It can be used to obtain the profiles proposed by tk.process in process.

  process(inputs, logFile = "process.log", cluster = NA, ...)
  process.log(..., logFile)
  process.core(input, inputName, steps = c("parse", "mask", "replicates", "waca",
    "export", "spatial", "segment", "fill", "modelize", "export", "fittest", "export",
    "applyModel", "export"), ...)
  process.parse(input, design, probeParser = Agilent.probes, probeArgs = list(), ...)
  process.probes(input, design, ...)
  process.regions(input, ...)
  process.mask(input, ...)
  process.replicates(input, replicateFun = stats::median, ...)
  process.waca(input, ...)
  process.spatial(input, outDirectory, ...)
  process.segment(input, segmentArgs = process.default("segmentArgs"), ...)
  process.fill(input, ...)
  process.modelize(input, modelizeArgs = process.default("modelizeArgs"), ...)
  process.applyModel(input, ...)
  process.fittest(input, ...)
  process.export(input, outDirectory, ...)
  tk.process(globalTopLevel, localTopLevel)
  process.default(argName, profileName)

`inputs`	List of `input` to dispatch to each node (preferably named). The default workflow expects it to be a character vector naming raw data files to be parsed.
`logFile`	Single character value, the path to the log file to produce with messages, warnings and errors. If the file already exists, it will be emptied first. The behavior when `logFile` is set to `NA` or "" depends on `cluster`: if `cluster` is `FALSE` (unparallelized mode), messages and errors will be passed to the R console rather than logged in a file, if `cluster` is anything else they will be silently ignored.
`cluster`	Arguments to be passed to `makeCluster` as a list, for parallel processing (requires the optionnal `parallel` package). Remote machines are not handled properly in the current version of `process`, you should limit to "spec" defining how many processors can be used on the local machine as an integer value. The `FALSE` value requires an unparallelized mode, slower but more suitable for error tracking. The `NA` default value tries to detect the CPU count on the local machine if `parallel` is installed, else switches to unparallelized mode.
`...`	Further arguments to be passed to `process` sub-functions, depending on the `steps` choosen (see below). The default workflow expects at least `design` and `outDirectory` to be provided.
`input`	A single input to process on one node. The default workflow expects it to be a single character value naming a raw data file to be parsed.
`inputName`	Single character value, the name of the input currently processed (for logging only).
`steps`	Ordered character vector, naming the processing steps to apply. Custom steps can be named as well, as long as a function named "process.[step]" exists in the global environment. Each step will take as input the output of the previous step, the first step taking the value of the `input` argument as input.
`probeParser`	The function to parse `probeFiles` into `cghRA.probes` objects, such as `Agilent.probes` for Agilent FeatureExtraction arrays.
`probeArgs`	A list of arguments to pass to `probeParser` (apart from 'file' which is always provided).
`design`	Single character vector, the path and name of the RDT design file, as produced by `tk.design`.
`replicateFun`	The function to apply to replicate groups, if the "replicate" step is to be applied. This function must use a vector of numeric values (logRatios) as input, and return a single representative value (typically `median` or `mean`).
`outDirectory`	Single character value, the directory in which produce the output files.
`segmentArgs`	Character vector, the arguments to be passed to the `DNAcopy` method of the `cghRA.array` class. Arguments are defined as a character string that will be parsed, multiple values define multiple segmentation profiles to apply sequentially.
`modelizeArgs`	Single character value, the arguments to be passed to the `model.auto` method of the `cghRA.array` class. Arguments are defined as a character string that will be parsed.
`argName`	Single character value, 'segmentArgs' or 'modelizeArgs', the argument to get the default value for. If missing, the list of profiles and arguments handled is returned.
`profileName`	Single character value, altering the default values returned. If missing, the default profile is returned.
`globalTopLevel`	This argument should be filled only when embedding this Tcl-Tk interface in an other. It is the top level of the embedding interface, generally a call to `tktoplevel`.
`localTopLevel`	This argument should be filled only when embedding this Tcl-Tk interface in an other. It is the local top level to use to build this interface, generally a `tkframe` or `ttkframe`.

Only process.default returns something : if argName is provided it returns the default value for the queried argument, else a list of profiles available for each handled argument. When many profiles are handled, the first value in the list is the default one (returned when profileName is missing).

The complete workflow involves the following steps :

parse: Read a raw data file and return a cghRA.array object.
probes: Read a cghRA.probes object stored in a RDT file and return a cghRA.array object.
regions: Reads one or many cghRA.regions file(s) stored in RDT file(s).
mask: Discard flagged probes (saturated, high background ...) in a cghRA.array object. Any TRUE value in a column whose name begins with "flag_" is enough to discard a probe (turn its logRatio into NA. See the cghRA.array$maskByFlag() method for further details.
replicates: Replace replicated probe groups (same "name") by a single representative value (all logRatios are turned to NA except from the first one which will hold the representative value). See the cghRA.array$replicates() method for further details.
waca: Apply the WACA algorithm to the logRatios. See the cghRA.array$WACA() method for further details.
spatial: Produce a PNG file to visually check spatial biases. See the cghRA.array$spatial() method for further details.
segment: Compute regions with similar logRatios along the genome, using the CBS algorithm. See the cghRA.array$DNAcopy() method for further details.
fill: Extend segments to the right to join consecutive segments. See the cghRA.regions$fillGaps() method for further details.
modelize: Fit a copy number model to segments, in order to convert logRatios to true copy numbers. If segmentArgs contains multiple values, each segmentation profile will lead to distinct "copies" and "regions" files numbered according to its position in segmentArgs. See the cghRA.regions$model.auto() method for further details.
applyModel: Convert a modelized cghRA.regions objects into cghRA.copies.
fittest: If multiple segmentation profiles have been used, select the fittest model ("copies" and "regions" files duplicated without number). For further details on the STM score used for fittest model selection, see the model.auto function of the cghRA.copies package.
clean: Erase "copies" and "regions" files of the different segmentation profiles tested, as "fittest" should have saved the best.