Using the SICtools Package

Introduction to SICtools

High-through sequencing has become fundamental for deciphering sequence variants between two paired-samples in different conditions, which has been vastly used in detecting somatic variants between tumor and matched normal samples in current research of oncogenesis.The SICtools package is designed to find SNV/Indel differences between two bam files with near relationship in a way of pairwise comparison thourgh parsing the allele frequency of genotypes (single nucleotide and short indel) at each base position across the genome region of interest. The difference is inferred by two complementary measurements, fisher exact test p.value and euclidean distance d.value. For SNV comparison, the internal input is the base count (A,T,G,C) in a given position, parsed from pileup output from the two bam files; for indel comparison, reads for different indel alleles that span no less than 2bp on both sides of extended indel region (e.g. homopolymer region) are counted as internal input.The candidate variants with relatively lower p.value and higher d.value can thus be easily identified from the output of SICtools.

Getting started with SICtools

Getting started with SICtools for inspection of SNV/Indel difference between two bam files is quite easy. Two critical functions (snpDiff and indelDiff) will be available in R session with loading the package.

library(SICtools)

Function snpDiff()

Input

The essential capability provided by SICtools is its input. The two bam files to be compared should be aligned by same aligner of the same version (important!) and the same reference genome. In theory, the two bam files should be in near relationship, which means that the SNV/Indel differences are not expected too many. The input coordinate for region of interest should be the same format as the reference genome. The argument list of function snpDiff is below,

snpDiff(bam1, bam2, refFsa, regChr, regStart, regEnd, minBaseQuality = 13, minMapQuality = 0, nCores = 1, pValueCutOff = 0.05, baseDistCutOff = 0.1, verbose = TRUE)

Except the three main paramters (bam1,bam2 and refFsa), the region coordinate arguments (regChr,regStart and regEnd) are necessary to restrict the genome region of interest, since the comparison would be time consuming if the chromosome units of the species are very long, for example, human. Meanwhile, these coordinate arguments make the parallel calculation possible. For instance, the long chromosome could be chunked into pieces and compared in parallel, and the final output would be combined together. Even in the genome region of interest, a parameter nCores is to set the threads for parallel calculation, which will greatly short the time in the case of long region input.

In order to control the input quality of reads, two quality filter arguments (minBaseQuality and minMapQuality) are provided. The minBaseQuality score is stranded Phred score of Sanger format for each base. The 'minMapQuality' score is based on the aligner used, which means differnt threshods would be adopted to control the mapping quality of the whole reads for different aligners.

Two output control parameters (pValueCutOff and baseDistCutOff) are used to filter the output. Since most of genomic positions to compare are the same in theory, p.value of exact fisher test would be 1 and d.value of Euclidean distance would be 0 in thousands, even millons of positions, which is not expected as output, and will be excluded from the final output. The default pValueCutOff = 0.05, baseDistCutOff = 0.1 is proper for comparison between germline samples. Lower baseDistCutOff and higher pValueCutOff is probably needed for somatic samples.

If verbose = TRUE, the progress information of genomic positions will be showed on screen

Output

The output of SNV/Indel comparisons is a data.frame. It will report the base count/read count for each allele, p.value (from fisher exact test) and d.value (from euclidean distance) filtered by pre-defined cutoff of p.value and d.value. If nothing difference, NULL will be returned.

Example

The example will detect SNV differences between two bam files in the region "chr04:962501-1026983". Setting "pValueCutOff=1,baseDistCutOff=0" will detect tiny differences, while the exact same genotype positions will be excluded from output by default.

bam1 <- system.file(package='SICtools','extdata','example1.bam')
bam2 <- system.file(package='SICtools','extdata','example2.bam')
refFsa <- system.file(package='SICtools','extdata','example.ref.fasta')

snpDiffDf <- snpDiff(bam1,bam2,refFsa,'chr04',962501,1026983,pValueCutOff=1,baseDistCutOff=0)
snpDiffDf

A simple scatter plot will show the most different candidates locating at top-right.

plot(-log10(snpDiffDf$p.value),snpDiffDf$d.value,col='brown')

For more complex situation with hundreds of outputs, sorting the data frame by p.value and d.value would be very helpful to set custom cutoffs after mannually check.

snpDiffDfSort <- snpDiffDf[order(snpDiffDf$p.value,snpDiffDf$d.value),]
snpDiffDfSort

Function indelDiff()

Detecting indel differences between two bam files is usually mislead by finding two indel lists seperately and then overlap them. Instead, indelDiff will firstly extract the read counts of each indel genotypes in two bam files at the same genomic position, and then calculate p.value of fisher exact test and d.value of Euclidean distance for this position. In case that the reads can't span the long indel, only reads that cover more than 2bp adjacent the given indel region are taken into consideration.

Input and Output

The input of indelDiff is the same as snpDiff, however, the output genotype names of indelDiff are different. For function snpDiff, 'A', 'T','G' and 'C' are four informative base, while for the output of indelDiff, the three genotypes are 'ref','altGt1' and 'altGt2', which means two alternative indel genotypes will be considered in the given position in both bam files.

Example

A simple input example and its output is

indelDiffDf <- indelDiff(bam1,bam2,refFsa,'chr07',828514,828914,pValueCutOff=1,gtDistCutOff=0)
indelDiffDfSort <- indelDiffDf[order(indelDiffDf$p.value,indelDiffDf$d.value),]
indelDiffDfSort

Conclusion

Detecting SNV/Indel difference between two bam files is frequently used in many research fields of high-throughput sequencing. We thus provide these two simple R functions to make the comparison easy and accurate. Though the cutoff of p.value and d.value of SICtools are usually determined by cutsom data, a scatter plot -log10(p.value) vs. d.value is very helpful to achieve it based on our own expericence.

Session Info

The following package and versions were used in the production of this vignette.

sessionInfo()


Try the SICtools package in your browser

Any scripts or data that you put into this service are public.

SICtools documentation built on Nov. 8, 2020, 5:12 p.m.