README.md
In grailbio/conta: Detect cross-contamination

Following scripts are used to run conta toolset:

First install conta library (outside conta folder, run): R CMD INSTALL --preclean --no-multiarch --with-keep.source conta

Full dbSNP file may be downloaded from: ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/common_all_20180423.vcf.gz

A tsv or pileup file (containing allele counts for each SNP) is used as input, along with a dbSNP reference vcf file to call contamination events. The analysis reports contamination calls, levels, and plots. It will display cnv metrics and bincounts for Y chromosome if files are provided. Input files: - dbSNP file must contain CAF info field and rsid. - TSV files (two pileup formats are supported, see example inputs under test folders), must contain chr, pos, and counts for each allele

This mode requires a set of samples that were already run with the run with conta analysis. It will use the genotypes for each sample calculated by conta to find samples that have a likelihood higher than the general likelihood calculated with the population allele frqeuencies.

Samples that are sequenced from the same genetic donor should have the same genotypes across SNPs. Conta provides a genotype concordance function to assist in sample swap analyses. The output of the concordance function is a value between 0 and 1. Where concordance values close to 1 (above 0.7 in cases where one of the samples may be contaminated) are considered the same genetic donor.

Expand upon following code to perform pairwise genotype concordance analyses:

conta_gt1 <- load_conta_file("s3:/conta_runs/conta_1/conta_1.gt.tsv")
conta_gt2 <- load_conta_file("s3:/conta_runs/conta_2/conta_2.gt.tsv")
concordance <- genotype_concordance(conta_gt1, conta_gt2)

.conta.tsv Contamination quantification main output.
.bin.lr.png Likelihood ratios for each chromosomal regions
.bin.lr.loh.png Likelihood ratios per chromosomal regions (non-LOH)
.depth.png Depth plot pre-filtering of SNPs
.filtered.depth.png Depth plot post-filtering of SNPs
.error.tsv Substitution error model
.gt.tsv Genotype calls
.gt.loh.tsv Genotype calls with LOH regions removed
.likelihood.png Conta maximum likelihood curve
.log.txt log file
.loh_regions.tsv LOH stats for each chromosomal region
.per_bin.tsv Stats for each chromosomal regions
.per_bin.loh.tsv Stats for each chromosomal regions (non-LOH)
.per_chr.tsv Stats for each chromosome
.per_chr.loh.tsv Stats for each chromosome (non-LOH)
.vfn.cp.png Variant frequency (negated) vs. contamination
.vr.png Variant ratios (across sorted locations to visualize LOH)

conta_version Version of conta that was used
conta_call Contamination call (tests if avg_log_lr passes a threshold)
cf Contamination fraction Ignore if conta_call = FALSE
sum_log_lr Sum of log likelihood ratios across SNPs
avg_log_lr Average of log likelihood ratios across SNPs
snps Number of SNPs considered
depth Mean number of (paired) reads per SNP
pos_lr_all Fraction of SNPs with positive likelihood ratio
pos_lr_x Fraction of SNPs with positive likelihood ratio on X chromosome
pos_lr_chr_cv Coefficient of variation of fraction of SNPs with positive lr
y_count Fraction of positions on Y chromosome with at least 1 read
pregnancy Pregnancy call (currently only for male pregnancy if Y chr avail)
excluded_regions Number of chromosomal regions excluded due to LOH
error_rate Average substitution error rate per base
T>A Average specific substitution error rate per base t to a
G>A Average specific substitution error rate per base g to a
C>A Average specific substitution error rate per base c to a
A>T Average specific substitution error rate per base a to t
G>T Average specific substitution error rate per base g to t
C>T Average specific substitution error rate per base c to t
A>G Average specific substitution error rate per base a to g
T>G Average specific substitution error rate per base t to g
C>G Average specific substitution error rate per base c to g
A>C Average specific substitution error rate per base a to c
T>C Average specific substitution error rate per base t to c
G>C Average specific substitution error rate per base g to c

Blackswan term is a threshold on the minimum probability a given event (SNP) may contribute to overall likelihood. Extremely rare events may get very low probabilities, and this measure prevents one or few artifactual signals to cause contamination calls. In other terms, blackswan controls the depth of signal for each SNP.
Baseline error model (error rate for each loci) may be provided optionally, otherwise default is to calculate a generic per sample substitution error model.
To detect contamination with bisulfite converted data, one may use A>T and T>A SNPs as input (pre-filter dbSNP file), which are unaffected by bisulfite conversion on CpG contexts. Also allowed are strand specific counts where each SNP would be counted on a specific strand. See tests for an example.
Current pregnancy metric can only detect male pregnancy (for female host) by considering the presence of partial Y chromosome. Y chromosome counts are provided by biometrics tool. In its absence, this metric will be NA.