model.auto: Automatic generation of copy number model
In cghRA: Array CGH Data Analysis and Visualization

Description Usage Arguments Details Value Author(s) See Also Examples

This function computes a copy number model, as needed by model.apply to translate logRatios into copy numbers.

  model.auto(segLogRatios, segChroms, segLengths = rep(1, length(segLogRatios)),
    from = 0.02, to = 0.5, by = 0.001, precision = 512, maxPeaks = 8, minWidth = 0.15,
    maxWidth = 0.9, minDensity = 0.001, peakFrom = -2, peakTo = 1.3, ploidy = 0,
    discreet = FALSE, method = c("stm", "sdd", "ptm"), exclude = c("X", "Y", "Xp", "Xq",
    "Yp", "Yq"))

`segLogRatios`	Double vector, the log ratios of the CGH segments to modelize.
`segChroms`	Vector, the chromosome holding the CGH segments to modelize.
`segLengths`	Double vector, the lengths of the CGH segments to modelize. Amount of probes should be prefered if available, but nucleotide length or no length at all can also be used.
`from`	Single double value, the minimal bandwidth to test for `density`.
`to`	Single double value, the maximal bandwidth to test for `density`.
`by`	Single double value, the precision of the bandwidths to test for `density`.
`precision`	Single integer value, the amount of points to compute for `density`. As its help page suggests, values greater than 512 should be powers of 2.
`maxPeaks`	Single integer value, the maximal amount of peaks in the density of distribution to consider a model as valid.
`minWidth`	Single double value, minimal value allowed for the `width` model parameter (thus for tumoral cell proportion in the sample).
`maxWidth`	Single double value, maximal value allowed for the `width` model parameter (thus for tumoral cell proportion in the sample).
`minDensity`	Single double value, minimal density for a peak to be detected.
`peakFrom`	Single double value, minimal logRatio for a peak to be detected. Use `NA` for no lower limit. Only 1, 2 and 3 copies peaks should be considered for a more precise model.
`peakTo`	Single double value, maximal logRatio for a peak to be detected. Use `NA` for no upper limit. Only 1, 2 and 3 copies peaks should be considered for a more precise model.
`ploidy`	Single numeric value, copy number supposed to be the most common within the analyzed genome.
`discreet`	Single logical value, if `FALSE` a fail in modelization raises an error, if `TRUE` it returns a `NA` filled model.
`method`	Single character value, the statistic to minimize ("stm" is default). See below for further details.
`exclude`	Vector, the chromosomes to exclude from the density computation and to plot with distinct symbols (use `NULL` to disable this feature). Sexual chromosomes should be excluded in heterogeneous DNA source, as their desequilibrium (2 'X' and no 'Y' in women) impact normal cells AND tumoral ones.

More details about the cghRA copy number model and modelization can be found in the vignette associated with this package, as well as in the related publication. Once the parameters of a model (width and center) are set, three scores can be computed to assess its fitness to the data :

STM is the "Segment To Model" score, computed at the segment level as the average of the residuals weighted by the segment size (in probe counts). Residuals are computed as the absolute difference between exact copy numbers (see the copies function) and their rounding, assuming that copy numbers should be integers and that decimal parts are noise in the model. This is the recommended score to use with cghRA.

PTM is the "Peak To Model" score, computed at the peak level as the average of the residuals. Residuals are computed as the absolute difference between exact copy numbers (see the copies function) and their rounding, assuming that copy numbers should be integers and that decimal parts are noise in the model.

SDD is the "Standard Deviation of peak Differences" score. As its name suggests, it is computed as the sd or differences between consecutive peaks, considering that good models should show very regularly spaced density peaks.

Returns a double vector, with the following values :

`bw`	Bandwidth used for `density` computation.
`peaks`	Amount of peaks considered in the model.
`peakFrom`	See the `peakFrom` argument.
`peakTo`	See the `peakTo` argument.
`center`	Center parameter of the model.
`width`	Width paremeter of the model.
`ploidy`	Ploidy paremeter of the model, as provided.
`sdd`	Quality statistic, see 'Details'.
`ptm`	Quality statistic, see 'Details'.
`stm`	Quality statistic, see 'Details'.

Sylvain Mareschal

model.test, model.apply

  # Generating random segmentation results
  ## with 30% normal cells contamination
  ## with +10% for normal DNA labelling
  segLogRatios <- c(
    rnorm(
      sample(5:20, 1),
      mean = log((1*0.7 + 2*0.3)/(2*1.1), 2),   # One deletion
      sd = 0.08
    ),
    rnorm(
      sample(80:120, 1),
      mean = log(2/(2*1.1), 2),                 # No alteration
      sd = 0.08
    ),
    rnorm(
      sample(40:60, 1),
      mean = log((3*0.7 + 2*0.3)/(2*1.1), 2),   # One more copy
      sd = 0.08
    )
  )
  segLogRatios <- sample(segLogRatios)
  segLengths <- as.integer(3 + round(rchisq(length(segLogRatios), 1)*100))
  segEnds <- cumsum(segLengths)
  segStarts <- c(1L, head(segEnds, -1))
  segChroms <- rep("chr1", length(segEnds))
  
  # Generated genome
  genome <- data.frame(
    segChroms,
    segStarts,
    segEnds,
    segLogRatios,
    segLengths
  )
  print(genome)
  
  # Automatic modelization
  model <- model.auto(
    segLogRatios = segLogRatios,
    segChroms = segChroms,
    segLengths = segLengths
  )
  print(model)