Following scripts are used to run conta toolset:
First install conta library (outside conta folder, run): R CMD INSTALL --preclean --no-multiarch --with-keep.source conta
Full dbSNP file may be downloaded from: ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/common_all_20180423.vcf.gz
A tsv or pileup file (containing allele counts for each SNP) is used as input, along with a dbSNP reference vcf file to call contamination events. The analysis reports contamination calls, levels, and plots. It will display cnv metrics and bincounts for Y chromosome if files are provided. Input files: - dbSNP file must contain CAF info field and rsid. - TSV files (two pileup formats are supported, see example inputs under test folders), must contain chr, pos, and counts for each allele
This mode requires a set of samples that were already run with the run with conta analysis. It will use the genotypes for each sample calculated by conta to find samples that have a likelihood higher than the general likelihood calculated with the population allele frqeuencies.
Samples that are sequenced from the same genetic donor should have the same genotypes across SNPs. Conta provides a genotype concordance function to assist in sample swap analyses. The output of the concordance function is a value between 0 and 1. Where concordance values close to 1 (above 0.7 in cases where one of the samples may be contaminated) are considered the same genetic donor.
Expand upon following code to perform pairwise genotype concordance analyses:
conta_gt1 <- load_conta_file("s3:/conta_runs/conta_1/conta_1.gt.tsv")
conta_gt2 <- load_conta_file("s3:/conta_runs/conta_2/conta_2.gt.tsv")
concordance <- genotype_concordance(conta_gt1, conta_gt2)
Blackswan term is a threshold on the minimum probability a given event (SNP) may contribute to overall likelihood. Extremely rare events may get very low probabilities, and this measure prevents one or few artifactual signals to cause contamination calls. In other terms, blackswan controls the depth of signal for each SNP.
Baseline error model (error rate for each loci) may be provided optionally, otherwise default is to calculate a generic per sample substitution error model.
To detect contamination with bisulfite converted data, one may use A>T and T>A SNPs as input (pre-filter dbSNP file), which are unaffected by bisulfite conversion on CpG contexts. Also allowed are strand specific counts where each SNP would be counted on a specific strand. See tests for an example.
Current pregnancy metric can only detect male pregnancy (for female host) by considering the presence of partial Y chromosome. Y chromosome counts are provided by biometrics tool. In its absence, this metric will be NA.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.