preProcSample: Pre-process a sample
In mskcc/facets: Cellular Fraction and Copy Numbers from Tumor Sequencing

Description Usage Arguments Details Value

Processes a snp read count matrix and generates a segmentation tree

1
2
3

  preProcSample(rcmat, ndepth=35, het.thresh=0.25, snp.nbhd=250, cval=25,
       deltaCN=0, gbuild=c("hg19", "hg38", "hg18", "mm9", "mm10", "udef"),
       ugcpct = NULL, hetscale=TRUE, unmatched=FALSE, ndepthmax=1000)

`rcmat`	data frame with 6 required columns: `Chrom`, `Pos`, `NOR.DP`, `NOR.RD`, `TUM.DP` and `TUM.RD`. Additional variables are ignored.
`ndepth`	minimum normal sample depth to keep
`het.thresh`	vaf threshold to call a SNP heterozygous
`snp.nbhd`	window size
`cval`	critical value for segmentation
`deltaCN`	minimum detectable difference in CN from diploid state
`gbuild`	genome build used for the alignment of the genome. Default value is human genome build hg19. Other possibilities are hg38 & hg18 for human and mm9 & mm10 for mouse. Chromosomes used for analysis are `1-22, X` for humans and `1-19` for mouse. Option udef can be used to analyze other genomes.
`ugcpct`	If udef is chosen for gbuild then appropriate GC percentage date should be provided through this option. This is a list of numeric vectors that gives the GC percentage windows of width 1000 bases in steps of 100 i.e. 1-1000, 101-1100 etc. for the autosomes and the X chromosome.
`hetscale`	logical variable to indicate if logOR should get more weight in the test statistics for segmentation and clustering. Usually only 10% of snps are hets and hetscale gives the logOR contribution to T-square as 0.25/proportion of hets.
`unmatched`	indicator of whether the normal sample is unmatched. When this is TRUE hets are called using tumor reads only and logOR calculations are different. Use het.thresh = 0.1 or lower when TRUE.
`ndepthmax`	loci for which normal coverage exceeds this number (default is 1000) will be discarded as PCR duplicates. Fof high coverage sample increase this and ndepth commensurately.

The SNPs in a genome are not evenly spaced. Some regions have multiple SNPs in a small neighborhood. Thus using all loci will induce serial correlation in the data. To avoid it we sample loci such that only a single locus is used in an interval of length snp.nbhd. So in order to get reproducible results use set.seed to fix the random number generator seed.

A list consisting of three elements:

`pmat`	Read counts and other elements of all the loci
`seg.tree`	a list of matrices one for each chromosome. the matrix gives the tree structure of the splits. each row corresponds to a segment with the parent row as the first element the start-1 and end index of each segment and the maximal T^2 statistic. the first row is the whole chromosome and its parent row is by definition 0.
`jointseg`	The data that were segmented. Only the loci that were sampled within a snp.nbhd are present. segment results given.
`hscl`	scaling factor for logOR data.