README.md

Contributors Forks Stargazers Issues MIT License LinkedIn

Logo

WadingPool

A shallow/low-pass whole-genome analaysis toolkit Explore the docs » View Demo · Report Bug · Request Feature

Table of Contents

  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. License
  7. Contact
  8. Acknowledgements

About The Project

A shallow/low-pass whole-genome analaysis toolkit, designed to do the following tasks: Identify genetic identity using dbSNP estimation Report QC metrics on alignment * Copy-number calling using QDNAseq or IchorCNA

Built With

Getting Started

WadingPool requires the use of R for analytic processes and Perl for preprocessing and file manipulation to be executed in a UNIX shell.

Prerequisites

The following are required tools and reference datasets needed to be installed on your system prior to running WadingPool:

^ Tool ^ Version ^ Purpose ^ | GATK | v4.0.5.1 | Calls the allelic counts at SNP sites | | MuTect | v1.1.5 | [Optional] Alternative to GATK to call variants at SNP sites | | Genome | hg19/hg38/mm38 | Reference genome | | Tabix | 0.2.6 | Works with the dbSNP VCF file to subset by chromosomes |

Installation

  1. Clone the repo sh devtools::install_github("quevedor2/WadingPool", ref='master') # Latest stable build devtools::install_github("quevedor2/WadingPool", ref='dev') # Development branch

dbSNP_Reference

A pre-built pipeline to set up the BED file from a dbSNP VCF file that is compatible with WadingPool can viewed in inst/setup/setup_dbSNP.sh (Currently: Only designed to work with hg19/hg38, not the mouse genome). The following steps outlines the pipeline to allow for user customization:

  1. Set up the dbSNP reference BED files
  2. dbSNP files are downloaded from ftp://ftp.ncbi.nih.gov/snp and are designed to work with the common_all category. According to dbSNP annotation, common_all refers to a subset of 00-All categorized as common (minor allele frequency >= 0.01 in at least one of 26 major populations, with at least two unrelated individuals having the minor allele) sh wget 'ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b146_GRCh37p13/VCF/common_all_20151104.vcf.gz' . wget 'ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b146_GRCh37p13/VCF/common_all_20151104.vcf.gz.tbi' .

  3. Separate the VCF file into individual chromosome subsets:

```sh module load tabix/0.2.6

vcffile='common_all_20180418.vcf.gz' mkdir chr_vcf chr_bed

## Use tabix to subset VCF into chromosomes zcat ${vcffile} | grep "^#" > header.txt for i in $(seq 1 1 22) X Y; do tabix ${vcffile} ${i} > tmp cat header.txt tmp > chr_vcf/chr${i}.vcf done rm tmp header.txt ```

  1. Use a custom perl script to create a BED file from the VCF files for SNPs that occupy only a single bp

```sh chrcol=0 poscol=1 rsidcol=2 refcol=3 infocol=7

rm chr_bed/all.${vcffile/.vcf.gz/}.bed time for i in $(seq 1 1 22) X Y; do grep -v "^#" chr_vcf/chr${i}.vcf |\ perl -ne 'my @lines = split("\t", $_); my $ref=$lines['${refcol}']; if(length($ref)==1){ my $chr=$lines['${chrcol}']; my $end=$lines['${poscol}']; my $rs=$lines['${rsidcol}']; my @caf=$lines['${infocol}'] =~ m/CAF=(.?),.?;/;

    print $chr,"\t",
          $end-1, "\t",
          $end, "\t",
          $rs, "\t";
    printf("%.3f\t.\t.\t.\n", 1-$caf[0]);
  }' \
> chr_bed/chr${i}.bed >> chr_bed/all.${vcffile/.vcf.gz/}.bed

done ```

Usage

This space highlights the main functions of the WadingPool package, simplified examples of how they are used, and a basic overview of how to interpret the data.

For more examples, please refer to the Documentation

Simulations

Heterozygous SNP simulations:

  1. Look at all SNPs from dbSNP with a MAF >= 0.01 (~7M SNPs)
  2. Use the MAF as a probability of a given SNP being heterozygous
  3. Use a Poisson distribution to simulate a given coverage for each SNP (i.e. for a coverage of 0.5x, lambda=0.5; rnorm(n=num_snps, lambda=0.5)
  4. Use a geometric expansion series to calculate the probability of obtaining coverage on both the Ref and Alt read for each SNP
  5. Calculate the expected value and 95% standard error for a heterozygous SNP across the entire genome
  6. Multiply the expected value and SE by the number of SNPs to get the Expected # of heterozygous SNPs in a genome for a given coverage
  7. Repeat process 2-6 for for multiple different coverages [seq(0, 3, by=0.1)]
  8. Fit a nonlinear asymptotic regression model to the estimated # of heterozygous SNPs per coverage

```R # Steps 2-5: Calculates expected value for heterozygous SNPs given coverage exp_hets <- calcExpectedHets(grch38_dbsnp, coverage=seq(0,3,by=0.1))

# Step 6: Calculates expected # of SNPs for a given coverage ## Number of Het.SNPs found in a single-sample samp1 <- getExpectedN(mu = exp_hets['mean',], se = exp_hets['se',], n = exp_hets['n',]) ## Number of Het.SNPs expected to found in common between two samples samp2 <- getExpectedN(mu = exp_hets['meansq',], se = exp_hets['se',], n = exp_hets['n',])

# Step 8: Fits an asymptotic regression model to the data ## singlesample=data.frame("cov"=as.numeric(rownames(samp2)), "SNPs"=samp1$n_mean)[-1,] single_sat <- saturationCurve(singlesample, pred = 1, # Predicting for 1x coverage S = 0.95, # Predicting Num. of HET SNPs at 0.95 saturation Xin = 1) # Predicting saturation for 1x coverage ```

Generate_SNP_Matrix

  1. User MUST call the SNP sites from the BAM file. Due to differences in user environments and HPC setups, I've included bash-scripts that I used in a SLURM and TORQUE workspace to call using either GATK (fast ~14min) or Mutect (slow ~24hrs):
  2. GATK4: runGATKCollectAllelicCount.sh
  3. MuTect: runMutect_sample.sh

  4. Use the genSnpZygosity() function to format either the tsv or vcf outputted from Step1 into a single-sample SNP file. This function utilizes the categorizeAD_GATK.sh script found in inst/bin/, that uses a custom Perl script to format the SNPs from the input data into either 0, 1, 2, 3 outputs.

  5. 0: No coverage - The SNP is covered by <2 reads
  6. 1: REF.HOMOZYGOUS - All reads support the reference allele
  7. 2: HETEROZYGOUS - The reads support both the REF and ALT allele
  8. 3: ALT.HOMOZYGOUS - All reads support the alternate allele

```R # Detect all samples outputted from GATK4 suffix <- 'allelicCounts.tsv' files <- list.files(PDIR, pattern=paste0(".", suffix, "$")) chrs <- 'all' samples <- gsub(paste0("^", unique(gsub("[0-9XY]", "", chrs)), ".?\.(.)", paste0(".", suffix, "$")), "\1", files)

# Call the SNP Zygosity snp_results <- lapply(samples, genSnpZygosity, caller='GATK', chrs=chrs) ```

  1. Use the aggregateAndFilter() function to combine all the samples outputted from Step 2 and store them in a single comma-separated matrix flat file. It will also use the getLinesFromFile.py perl script stored in inst/bin to identify and remove SNPs where fewer than 2 samples have enough coverage (0). It will do the same with the bed file from dbSNP to maintain the same genomic location mapping:

R s_paths <- file.path("tmp", gsub("[^a-zA-Z0-9]", "", samples), paste0("[ID].", samples, "_out.tsv")) agg_results <- aggregateAndFilter(s_paths, num_samples=2, chrs=chrs, dbsnp_file='all.common_all_20151104.bed',

  1. Output data should now be stored in: ── tmp # Main output folder ├── combined │   ├── all.snps # Unfiltered SNPs │   ├── filt │   │   ├── all_filt.snps # Filtered SNPs │   │   ├── all_pos.snps # Filtered dbSNP BED file │   └── idx │   ├── all_lines.txt # Index of SNPs to remove

and contains two main files [CHR]_filt.snps: 1,0,1,0 1,0,1,0 1,0,1,0 1,0,0,1 and [CHR]_pos.snps: 1 51478 51479 rs116400033 0.128 . . . 1 51746 51747 rs567078550 0.002 . . . 1 51750 51751 rs538930828 0.002 . . . 1 54638 54639 rs540298864 0.001 . . .

Genotype_Similarity

  1. Work in progress: R getSampleSimilarity(filt_dir, samples)

Roadmap

See the open issues for a list of proposed features (and known issues).

Contributing

Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Rene Quevedo - quever88@gmail.com

Project Link: https://github.com/quevedor2/WadingPool

Acknowledgements



quevedor2/WadingPool documentation built on Dec. 22, 2021, 10:59 a.m.