low-pass whole genome sequencing analysis

A shallow/low-pass whole-genome analaysis toolkit Explore the docs » View Demo · Report Bug · Request Feature

About The Project
- Built With
Getting Started
Usage
Roadmap
Contributing
License
Contact
Acknowledgements

A shallow/low-pass whole-genome analaysis toolkit, designed to do the following tasks: Identify genetic identity using dbSNP estimation Report QC metrics on alignment * Copy-number calling using QDNAseq or IchorCNA

WadingPool requires the use of R for analytic processes and Perl for preprocessing and file manipulation to be executed in a UNIX shell.

The following are required tools and reference datasets needed to be installed on your system prior to running WadingPool:

dbSNP bed file: The dbSNP file indicates which SNP sites of a given MAF (>0.01) should be used for zygosity/genomic identity analysis. sh # human_9606_b146_GRCh37p13 # dbSNP b146; GRCh37 # Summary of VCF Files (http://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/): # All - all human rs in except those listed above, without restriction by clinical significance. This file pairs with All_papu file below. # common_all - The subset of 00-All categorized as common (minor allele frequency >= 0.01 in at least one of 26 major populations, with at least two unrelated individuals having the minor allele) as described below REFPATH='/path/to/referenceDir' wget 'ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b146_GRCh37p13/VCF/common_all_20151104.vcf.gz' ${REFPATH} wget 'ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b146_GRCh37p13/VCF/common_all_20151104.vcf.gz.tbi' ${REFPATH}

Clone the repo sh devtools::install_github("quevedor2/WadingPool", ref='master') # Latest stable build devtools::install_github("quevedor2/WadingPool", ref='dev') # Development branch

A pre-built pipeline to set up the BED file from a dbSNP VCF file that is compatible with WadingPool can viewed in inst/setup/setup_dbSNP.sh (Currently: Only designed to work with hg19/hg38, not the mouse genome). The following steps outlines the pipeline to allow for user customization:

Set up the dbSNP reference BED files
dbSNP files are downloaded from ftp://ftp.ncbi.nih.gov/snp and are designed to work with the common_all category. According to dbSNP annotation, common_all refers to a subset of 00-All categorized as common (minor allele frequency >= 0.01 in at least one of 26 major populations, with at least two unrelated individuals having the minor allele) sh wget 'ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b146_GRCh37p13/VCF/common_all_20151104.vcf.gz' . wget 'ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b146_GRCh37p13/VCF/common_all_20151104.vcf.gz.tbi' .
Separate the VCF file into individual chromosome subsets:

```sh module load tabix/0.2.6

vcffile='common_all_20180418.vcf.gz' mkdir chr_vcf chr_bed

## Use tabix to subset VCF into chromosomes zcat ${vcffile} | grep "^#" > header.txt for i in $(seq 1 1 22) X Y; do tabix ${vcffile} ${i} > tmp cat header.txt tmp > chr_vcf/chr${i}.vcf done rm tmp header.txt ```

Use a custom perl script to create a BED file from the VCF files for SNPs that occupy only a single bp

```sh chrcol=0 poscol=1 rsidcol=2 refcol=3 infocol=7

rm chr_bed/all.${vcffile/.vcf.gz/}.bed time for i in $(seq 1 1 22) X Y; do grep -v "^#" chr_vcf/chr${i}.vcf |\ perl -ne 'my @lines = split("\t", $_); my $ref=$lines['${refcol}']; if(length($ref)==1){ my $chr=$lines['${chrcol}']; my $end=$lines['${poscol}']; my $rs=$lines['${rsidcol}']; my @caf=$lines['${infocol}'] =~ m/CAF=(.?),.?;/;

    print $chr,"\t",
          $end-1, "\t",
          $end, "\t",
          $rs, "\t";
    printf("%.3f\t.\t.\t.\n", 1-$caf[0]);
  }' \
> chr_bed/chr${i}.bed >> chr_bed/all.${vcffile/.vcf.gz/}.bed

done ```

This space highlights the main functions of the WadingPool package, simplified examples of how they are used, and a basic overview of how to interpret the data.

For more examples, please refer to the Documentation

Heterozygous SNP simulations:

Look at all SNPs from dbSNP with a MAF >= 0.01 (~7M SNPs)
Use the MAF as a probability of a given SNP being heterozygous
Use a Poisson distribution to simulate a given coverage for each SNP (i.e. for a coverage of 0.5x, lambda=0.5; rnorm(n=num_snps, lambda=0.5)
Use a geometric expansion series to calculate the probability of obtaining coverage on both the Ref and Alt read for each SNP
Calculate the expected value and 95% standard error for a heterozygous SNP across the entire genome
Multiply the expected value and SE by the number of SNPs to get the Expected # of heterozygous SNPs in a genome for a given coverage
Repeat process 2-6 for for multiple different coverages [seq(0, 3, by=0.1)]
Fit a nonlinear asymptotic regression model to the estimated # of heterozygous SNPs per coverage

```R # Steps 2-5: Calculates expected value for heterozygous SNPs given coverage exp_hets <- calcExpectedHets(grch38_dbsnp, coverage=seq(0,3,by=0.1))

# Step 6: Calculates expected # of SNPs for a given coverage ## Number of Het.SNPs found in a single-sample samp1 <- getExpectedN(mu = exp_hets['mean',], se = exp_hets['se',], n = exp_hets['n',]) ## Number of Het.SNPs expected to found in common between two samples samp2 <- getExpectedN(mu = exp_hets['meansq',], se = exp_hets['se',], n = exp_hets['n',])

# Step 8: Fits an asymptotic regression model to the data ## singlesample=data.frame("cov"=as.numeric(rownames(samp2)), "SNPs"=samp1$n_mean)[-1,] single_sat <- saturationCurve(singlesample, pred = 1, # Predicting for 1x coverage S = 0.95, # Predicting Num. of HET SNPs at 0.95 saturation Xin = 1) # Predicting saturation for 1x coverage ```

User MUST call the SNP sites from the BAM file. Due to differences in user environments and HPC setups, I've included bash-scripts that I used in a SLURM and TORQUE workspace to call using either GATK (fast ~14min) or Mutect (slow ~24hrs):
GATK4: runGATKCollectAllelicCount.sh
MuTect: runMutect_sample.sh
Use the genSnpZygosity() function to format either the tsv or vcf outputted from Step1 into a single-sample SNP file. This function utilizes the categorizeAD_GATK.sh script found in inst/bin/, that uses a custom Perl script to format the SNPs from the input data into either 0, 1, 2, 3 outputs.
0: No coverage - The SNP is covered by <2 reads
1: REF.HOMOZYGOUS - All reads support the reference allele
2: HETEROZYGOUS - The reads support both the REF and ALT allele
3: ALT.HOMOZYGOUS - All reads support the alternate allele

```R # Detect all samples outputted from GATK4 suffix <- 'allelicCounts.tsv' files <- list.files(PDIR, pattern=paste0(".", suffix, "$")) chrs <- 'all' samples <- gsub(paste0("^", unique(gsub("[0-9XY]", "", chrs)), ".?\.(.)", paste0(".", suffix, "$")), "\1", files)

# Call the SNP Zygosity snp_results <- lapply(samples, genSnpZygosity, caller='GATK', chrs=chrs) ```

Use the aggregateAndFilter() function to combine all the samples outputted from Step 2 and store them in a single comma-separated matrix flat file. It will also use the getLinesFromFile.py perl script stored in inst/bin to identify and remove SNPs where fewer than 2 samples have enough coverage (0). It will do the same with the bed file from dbSNP to maintain the same genomic location mapping:

R s_paths <- file.path("tmp", gsub("[^a-zA-Z0-9]", "", samples), paste0("[ID].", samples, "_out.tsv")) agg_results <- aggregateAndFilter(s_paths, num_samples=2, chrs=chrs, dbsnp_file='all.common_all_20151104.bed',

Output data should now be stored in: ── tmp # Main output folder ├── combined │ ├── all.snps # Unfiltered SNPs │ ├── filt │ │ ├── all_filt.snps # Filtered SNPs │ │ ├── all_pos.snps # Filtered dbSNP BED file │ └── idx │ ├── all_lines.txt # Index of SNPs to remove

and contains two main files [CHR]_filt.snps: 1,0,1,0 1,0,1,0 1,0,1,0 1,0,0,1 and [CHR]_pos.snps: 1 51478 51479 rs116400033 0.128 . . . 1 51746 51747 rs567078550 0.002 . . . 1 51750 51751 rs538930828 0.002 . . . 1 54638 54639 rs540298864 0.001 . . .

Work in progress: R getSampleSimilarity(filt_dir, samples)

See the open issues for a list of proposed features (and known issues).

Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

Distributed under the MIT License. See LICENSE for more information.

Rene Quevedo - quever88@gmail.com

Project Link: https://github.com/quevedor2/WadingPool

quevedor2/WadingPool documentation built on Dec. 22, 2021, 10:59 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

quevedor2/WadingPool
A toolkit for shallow/low-pass whole genome sequencing analysis

README.md
In quevedor2/WadingPool: A toolkit for shallow/low-pass whole genome sequencing analysis

WadingPool

Table of Contents

About The Project

Built With

Getting Started

Prerequisites

Installation

dbSNP_Reference

Usage

Simulations

Heterozygous SNP simulations:

Generate_SNP_Matrix

Genotype_Similarity

Roadmap

Contributing

License

Contact

Acknowledgements

R Package Documentation

Browse R Packages

We want your feedback!

quevedor2/WadingPool A toolkit for shallow/low-pass whole genome sequencing analysis

README.md In quevedor2/WadingPool: A toolkit for shallow/low-pass whole genome sequencing analysis

WadingPool

Table of Contents

About The Project

Built With

Getting Started

Prerequisites

Installation

dbSNP_Reference

Usage

Simulations

Heterozygous SNP simulations:

Generate_SNP_Matrix

Genotype_Similarity

Roadmap

Contributing

License

Contact

Acknowledgements

R Package Documentation

Browse R Packages

We want your feedback!

quevedor2/WadingPool
A toolkit for shallow/low-pass whole genome sequencing analysis

README.md
In quevedor2/WadingPool: A toolkit for shallow/low-pass whole genome sequencing analysis