A shallow/low-pass whole-genome analaysis toolkit Explore the docs » View Demo · Report Bug · Request Feature
A shallow/low-pass whole-genome analaysis toolkit, designed to do the following tasks: Identify genetic identity using dbSNP estimation Report QC metrics on alignment * Copy-number calling using QDNAseq or IchorCNA
WadingPool requires the use of R for analytic processes and Perl for preprocessing and file manipulation to be executed in a UNIX shell.
The following are required tools and reference datasets needed to be installed on your system prior to running WadingPool:
^ Tool ^ Version ^ Purpose ^ | GATK | v4.0.5.1 | Calls the allelic counts at SNP sites | | MuTect | v1.1.5 | [Optional] Alternative to GATK to call variants at SNP sites | | Genome | hg19/hg38/mm38 | Reference genome | | Tabix | 0.2.6 | Works with the dbSNP VCF file to subset by chromosomes |
sh
# human_9606_b146_GRCh37p13
# dbSNP b146; GRCh37
# Summary of VCF Files (http://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/):
# All - all human rs in except those listed above, without restriction by clinical significance. This file pairs with All_papu file below.
# common_all - The subset of 00-All categorized as common (minor allele frequency >= 0.01 in at least one of 26 major populations, with at least two unrelated individuals having the minor allele) as described below
REFPATH='/path/to/referenceDir'
wget 'ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b146_GRCh37p13/VCF/common_all_20151104.vcf.gz' ${REFPATH}
wget 'ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b146_GRCh37p13/VCF/common_all_20151104.vcf.gz.tbi' ${REFPATH}
sh
devtools::install_github("quevedor2/WadingPool", ref='master') # Latest stable build
devtools::install_github("quevedor2/WadingPool", ref='dev') # Development branch
A pre-built pipeline to set up the BED file from a dbSNP VCF file that is compatible with WadingPool can viewed in inst/setup/setup_dbSNP.sh
(Currently: Only designed to work with hg19/hg38, not the mouse genome). The following steps outlines the pipeline to allow for user customization:
dbSNP files are downloaded from ftp://ftp.ncbi.nih.gov/snp
and are designed to work with the common_all
category. According to dbSNP annotation, common_all
refers to a subset of 00-All categorized as common (minor allele frequency >= 0.01 in at least one of 26 major populations, with at least two unrelated individuals having the minor allele)
sh
wget 'ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b146_GRCh37p13/VCF/common_all_20151104.vcf.gz' .
wget 'ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b146_GRCh37p13/VCF/common_all_20151104.vcf.gz.tbi' .
Separate the VCF file into individual chromosome subsets:
```sh module load tabix/0.2.6
vcffile='common_all_20180418.vcf.gz' mkdir chr_vcf chr_bed
## Use tabix to subset VCF into chromosomes zcat ${vcffile} | grep "^#" > header.txt for i in $(seq 1 1 22) X Y; do tabix ${vcffile} ${i} > tmp cat header.txt tmp > chr_vcf/chr${i}.vcf done rm tmp header.txt ```
```sh chrcol=0 poscol=1 rsidcol=2 refcol=3 infocol=7
rm chr_bed/all.${vcffile/.vcf.gz/}.bed time for i in $(seq 1 1 22) X Y; do grep -v "^#" chr_vcf/chr${i}.vcf |\ perl -ne 'my @lines = split("\t", $_); my $ref=$lines['${refcol}']; if(length($ref)==1){ my $chr=$lines['${chrcol}']; my $end=$lines['${poscol}']; my $rs=$lines['${rsidcol}']; my @caf=$lines['${infocol}'] =~ m/CAF=(.?),.?;/;
print $chr,"\t",
$end-1, "\t",
$end, "\t",
$rs, "\t";
printf("%.3f\t.\t.\t.\n", 1-$caf[0]);
}' \
> chr_bed/chr${i}.bed >> chr_bed/all.${vcffile/.vcf.gz/}.bed
done ```
This space highlights the main functions of the WadingPool package, simplified examples of how they are used, and a basic overview of how to interpret the data.
For more examples, please refer to the Documentation
```R # Steps 2-5: Calculates expected value for heterozygous SNPs given coverage exp_hets <- calcExpectedHets(grch38_dbsnp, coverage=seq(0,3,by=0.1))
# Step 6: Calculates expected # of SNPs for a given coverage ## Number of Het.SNPs found in a single-sample samp1 <- getExpectedN(mu = exp_hets['mean',], se = exp_hets['se',], n = exp_hets['n',]) ## Number of Het.SNPs expected to found in common between two samples samp2 <- getExpectedN(mu = exp_hets['meansq',], se = exp_hets['se',], n = exp_hets['n',])
# Step 8: Fits an asymptotic regression model to the data ## singlesample=data.frame("cov"=as.numeric(rownames(samp2)), "SNPs"=samp1$n_mean)[-1,] single_sat <- saturationCurve(singlesample, pred = 1, # Predicting for 1x coverage S = 0.95, # Predicting Num. of HET SNPs at 0.95 saturation Xin = 1) # Predicting saturation for 1x coverage ```
GATK
(fast ~14min) or Mutect
(slow ~24hrs):MuTect: runMutect_sample.sh
Use the genSnpZygosity()
function to format either the tsv
or vcf
outputted from Step1 into a single-sample SNP file. This function utilizes the categorizeAD_GATK.sh
script found in inst/bin/
, that uses a custom Perl script to format the SNPs from the input data into either 0, 1, 2, 3 outputs.
```R # Detect all samples outputted from GATK4 suffix <- 'allelicCounts.tsv' files <- list.files(PDIR, pattern=paste0(".", suffix, "$")) chrs <- 'all' samples <- gsub(paste0("^", unique(gsub("[0-9XY]", "", chrs)), ".?\.(.)", paste0(".", suffix, "$")), "\1", files)
# Call the SNP Zygosity snp_results <- lapply(samples, genSnpZygosity, caller='GATK', chrs=chrs) ```
aggregateAndFilter()
function to combine all the samples outputted from Step 2 and store them in a single comma-separated matrix flat file. It will also use the getLinesFromFile.py
perl script stored in inst/bin
to identify and remove SNPs where fewer than 2 samples have enough coverage (0). It will do the same with the bed file from dbSNP to maintain the same genomic location mapping:R
s_paths <- file.path("tmp", gsub("[^a-zA-Z0-9]", "", samples),
paste0("[ID].", samples, "_out.tsv"))
agg_results <- aggregateAndFilter(s_paths, num_samples=2, chrs=chrs,
dbsnp_file='all.common_all_20151104.bed',
── tmp # Main output folder
├── combined
│ ├── all.snps # Unfiltered SNPs
│ ├── filt
│ │ ├── all_filt.snps # Filtered SNPs
│ │ ├── all_pos.snps # Filtered dbSNP BED file
│ └── idx
│ ├── all_lines.txt # Index of SNPs to remove
and contains two main files [CHR]_filt.snps
:
1,0,1,0
1,0,1,0
1,0,1,0
1,0,0,1
and [CHR]_pos.snps
:
1 51478 51479 rs116400033 0.128 . . .
1 51746 51747 rs567078550 0.002 . . .
1 51750 51751 rs538930828 0.002 . . .
1 54638 54639 rs540298864 0.001 . . .
R
getSampleSimilarity(filt_dir, samples)
See the open issues for a list of proposed features (and known issues).
Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.
git checkout -b feature/AmazingFeature
)git commit -m 'Add some AmazingFeature'
)git push origin feature/AmazingFeature
)Distributed under the MIT License. See LICENSE
for more information.
Rene Quevedo - quever88@gmail.com
Project Link: https://github.com/quevedor2/WadingPool
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.