call.abnormal.cov: Call abnormal bins

Description Usage Arguments Details Value Author(s)

View source: R/call.abnormal.cov.R

Description

Detect abnormal bin from the Z-score distribution. A normal distribution is first fitted to the Z-score distribution. P-values are computed from this estimated null distribution and corrected for multiple testing. Eventually consecutive bins with abnormal read counts can be merged.

Usage

1
2
3
4
5
call.abnormal.cov(files.df, samp, out.pdf = NULL, FDR.th = 0.05,
  merge.cons.bins = c("stitch", "zscores", "cbs", "no"), stitch.dist = NULL,
  max.gap.size = 1e+05, z.th = c("sdest", "consbins", "sdest2N"),
  norm.stats = NULL, min.normal.prop = 0.9, aneu.chrs = NULL,
  gc.df = NULL, sub.z = NULL, outfile.pv = NULL)

Arguments

files.df

a data.frame with the paths to different sample files (bin count, Z-scores, ..). Here columns 'z' and 'fc' are used to retrieve Z-scores and fold changes.

samp

the name of the sample to analyze.

out.pdf

the name of the output pdf file.

FDR.th

the False Discovery Rate to use for the calls.

merge.cons.bins

how the bins should be merged. Default is 'stitch'. 'zscores' is another approch (see Details), 'no' means no bin merging.

stitch.dist

the maximal distance between two calls to be merged into one (if 'merge.cons.bins="stitch"'). If NULL (default), the bin size + 1 is used.

max.gap.size

the maximum gap between bins allowed for CBS. Default is 100 kb. Calls will not span gaps larger than this (e.g. centromere).

z.th

how the threshold for abnormal Z-score is chosen. Default is 'sdest' which will use 'FDR.th=' parameter as well. 'consbins' looks at the number of consecutive bins, see Details.

norm.stats

the name of the file with the normalization statistics ('norm.stats' in 'tn.norm' function) or directly a 'norm.stats' data.frame.

min.normal.prop

the minimum proportion of the regions expected to be normal. Default is 0.9. For cancers with many large aberrations, this number can be lowered. Maximum value accepted is 0.98 .

aneu.chrs

the names of the chromosomes to remove because flagged as aneuploid. If NULL (default) all chromosomes are analyzed.

gc.df

a data.frame with the GC content in each bin, for the Z-score normalization. Columns required: chr, start, end, GCcontent. If NULL (default), no normalization is performed.

sub.z

if non-NULL the number of bins in a sub-segment for Z-score null distribution estimation. Default is NULL. If highly rearranged genomes (cancer), try '1e4'.

outfile.pv

if non-NULL, the name of the file to write all the Pvalues (for all bins). Used in some analysis (e.g. annotate.with.parents).

Details

Two approaches can be used to define if a bin has abnormal threshold. By default ('sdest'), the null Normal distribution standard deviation is estimated by sequencially trimming the Z-score distribution and using an estimator for censored values. Once the Z-scores corresponding to the abnormal bins are trimmed out, the estimator reaches a plateau which is used as estimator for the null standard deviation. Using this parameter, P-values and Q-values are computed; abnormal bins are then defined by a user-defined FDR threshold on the Q-values. An alternative approach, 'consbins', looks at the distribution of consecutive bins to define the best threshold on the Z-scores. A wide range of thresholds are eplored. For each threshold, selected bins are stitched together if directly consecutive and the proportion of single and pair bins is computed. With a loose value-many selected bins-, pairs of consecutive bins happen by chance. More stringent values decreases the proportion of pairs and increases the number of single bins until it reaches true calls that are more likely to be consecutive. The Z-score threshold is defined as the changepoint between random and true calls distribution. Eventually another version of 'sdest' is implemented but this time fitting two Gaussian distribution (centered in 0). This approach, 'sdest2N', is more suited when we suspect that the sample tested is not completely comparable to the reference samples. With the two Gaussian distribution a longer tail can be integrated in the null distribution, reducing the potential false calls in presence of a long-tail.

Two approaches are available to merge bins with abnormal read coverage. 'stitch' simply stitches bins passing a user-defined significance threshold. In this approach, the stitching distance specifies the maximum distance between two bins that will be merged. By default the bin size is used, i.e. two abnormal bins will be merged if separated by maximum one bin. 'zscores' approach looks at the Z-score of two consecutive bins: if the minimum(maximum) is significantly higher(lower) than a simulated null distribution, these two bins will be merged to create a larger duplication(deletion).

For cancer samples, 'min.normal.prop' can be reduced, e.g. to 0.6. Aneuploid can also be removed with 'aneu.chrs'. Function 'aneuploidy.flag' can help flagging aneuploid chromosomes.

Value

a data.frame with columns

chr, start, end

the genomic region definition.

z

the Z-score.

pv, qv

the P-value and Q-value(~FDR).

fc

the copy number estimate (if 'fc' was not NULL).

nb.bin.cons

the number of consecutive bins (if the bins were merged, i.e. ' 'merge.cons.bins!='no”).

cn2.dev

Copy number deviation from the reference.

Author(s)

Jean Monlong


jmonlong/PopSV documentation built on Sept. 15, 2019, 9:29 p.m.