distIUPAC: calculates IUPAC distances on Biostrings DNAStringSet and BStringSet objects

title: "02. distStats" author: "Kristian K Ullrich" date: "2019-02-13" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{02. distStats} %\VignetteEngine{knitr::rmarkdown} %\usepackage[utf8]{inputenc}

In this vignette Users learn how to use distIUPAC to calculate IUPAC distances based on IUPAC fasta format (IUPAC code) files.

Next to learn how to calculate a distance matrix for one sequence alignment also within population distances (xStats) and between populations distances (xyStats, xyMultiStats and xyzStats__) based on sliding windows will be shown.

One pre-requistes is a fasta format alignment file (same length for all sequences) either in non IUPAC fasta format ('ACTG-N') or in IUPAC fasta format_ ('ACGTRYSWKMBDHV.-NX').

To get IUPAC fasta format files from reference mapped bam files see the vignette 01. bam2IUPAC vignette.

The user needs to perform five steps:

fastq files, prefernetially QC pre-processed (see trimmomatic for one of many trimming tools available)
reference fasta file
build index for the reference
map reads
sort reads

#FORWARD FASTQ FILE: 1.fq
#REVERSE FASTQ FILE: 2.fq
#REFERENCE: ref.fasta
#USED THREADS: 12
#SAMPLE NAME: IND1
#SAM OUTPUT FILE: ind1.sam
#build index
bwa index ref.fasta
#map reads
bwa mem -t 12 -M -R @RG\tID:IND1\tLB:lib1\tPL:ILLUMINA\tPU:unit1\tSM:IND1 -o ind1.sam ref.fasta 1.fq 2.fq
#sort sam file
java -jar picard.jar SortSam I=ind1.sam O=ind1.sorted.bam SO=coordinate

The usere needs to perform four steps:

fastq files, prefernetially QC pre-processed (see trimmomatic for one of many trimming tools available)
reference fasta file
map reads
sort reads

#FORWARD FASTQ FILE: 1.fq
#REVERSE FASTQ FILE: 2.fq
#REFERENCE: ref.fasta
#USED THREADS: 12
#SAMPLE NAME: IND1
#SAM OUTPUT FILE: ind1.bam
#map reads
ngm -1 1.fq -2 2.fq -r ref.fasta -o ind1.bam --no-unal --sensitive -t 12 --no-progress --rg-id IND1 --rg-sm IND1 --rg-lb lib1 --rg-pl illumina --rg-pu unit1 -b
#sort bam file
java -jar picard.jar SortSam I=ind1.bam O=ind1.sorted.bam SO=coordinate

In case that multiple fastq libraries exists or have been mapped without prior merging of the fastq files, the User can merge reference mapped bam files as follows:

java -jar picard.jar MergeSamFiles I=ind1_1.sorted.bam I=ind1_2.sorted.bam O=ind1.sorted.bam

java -jar picard.jar RemoveDuplicates REMOVE_DUPLICATES=true I=ind1.sorted.bam O=ind1.sorted.nodup.bam M=ind1.sorted.bam.duplicate.metrics

Chromosomes should be processed separately to be able to merge different samples into chromosome alignments, which can be processed with distIUPAC as follows:

#chromosomes should be processed separately to be able to easily merge different samples into one alignment to be processed with 'distIUPAC'
#chromosome 'chr1' will be processed here
samtools index ind1.sorted.nodup.bam
angsd -doFasta 4 -doCounts 1 -minQ 20 -minMapQ 30 -uniqueOnly -setMinDepth 5 -setMaxDepth 100 -iupacRatio 0.2 -i ind1.sorted.nodup.bam -out IND1.minQ20.minMapQ30.uniqueOnly.setMinDepth5.setMaxDepth100.chr1 -r chr1

Assuming all IUPAC fasta files are located in the same folder and same chromosome files have the same ending, individuals can be merged as follows:

#chromosome 'chr1' from different individuals will be processed here
#1. uncompress fasta files
for file in *.chr1.fa.gz;do gunzip $file;done
#2. rename fasta sequences according to file names
for file in *.chr1.fa;do sed -i 's/>chr1/>'"$file"'/g' $file;done
#3. merge fasta files
for file in *.chr1.fa;do cat $file >> chr1.fa;done

The following commands needs to be executed in R:

library(distIUPAC)
dna<-readDNAStringSet("chr1.fa")
chr1.dist<-distIUPAC(as.character(subseq(dna,1,10000)))

Korneliussen TS, Albrechtsen A and Nielsen R. ANGSD: Analysis of Next Generation Sequencing Data. BMC Bioinformatics (2014) 15:356 https://doi.org/10.1186/s12859-014-0356-4

Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics (2009) 25:1754 https://doi.org/10.1093/bioinformatics/btp324

Picard tools http://broadinstitute.github.io/picard

Sedlazeck FJ, Rescheneder P, von Haeseler A. NextGenMap: fast and accurate read mapping in highly polymorphic genomes. Bioinformatics (2013) 21:2790 https://doi.org/10.1093/bioinformatics/btt468

kullrich/distIUPAC documentation built on Jan. 9, 2020, 2:50 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

kullrich/distIUPAC
calculates IUPAC distances on Biostrings DNAStringSet and BStringSet objects

vignettes/distStats.md
In kullrich/distIUPAC: calculates IUPAC distances on Biostrings DNAStringSet and BStringSet objects

Pre-requistes for obtaining fasta format files from reference mapped bam files

STEP 1: Loading

Using `BWA` for mapping:

Using `NextGenMap` for mapping

OPTIONAL STEP: Merge `sorted bam` files

STEP 2: Remove duplicates from `sorted bam` file

STEP 3: Create `IUPAC fasta` file for the de-duplicated `referenced mapped bam` file

STEP 4: Repeat STEP 1 to STEP 3 for multiple individuals

STEP 5: Merge samples into chromosome alignments

Calculating `IUPAC distances` for 'chr1' with `distIUPAC`

References

R Package Documentation

Browse R Packages

We want your feedback!

kullrich/distIUPAC calculates IUPAC distances on Biostrings DNAStringSet and BStringSet objects

vignettes/distStats.md In kullrich/distIUPAC: calculates IUPAC distances on Biostrings DNAStringSet and BStringSet objects

Pre-requistes for obtaining fasta format files from reference mapped bam files

STEP 1: Loading

Using BWA for mapping:

Using NextGenMap for mapping

OPTIONAL STEP: Merge sorted bam files

STEP 2: Remove duplicates from sorted bam file

STEP 3: Create IUPAC fasta file for the de-duplicated referenced mapped bam file

STEP 4: Repeat STEP 1 to STEP 3 for multiple individuals

STEP 5: Merge samples into chromosome alignments

Calculating IUPAC distances for 'chr1' with distIUPAC

References

R Package Documentation

Browse R Packages

We want your feedback!

kullrich/distIUPAC
calculates IUPAC distances on Biostrings DNAStringSet and BStringSet objects

vignettes/distStats.md
In kullrich/distIUPAC: calculates IUPAC distances on Biostrings DNAStringSet and BStringSet objects

Using `BWA` for mapping:

Using `NextGenMap` for mapping

OPTIONAL STEP: Merge `sorted bam` files

STEP 2: Remove duplicates from `sorted bam` file

STEP 3: Create `IUPAC fasta` file for the de-duplicated `referenced mapped bam` file

Calculating `IUPAC distances` for 'chr1' with `distIUPAC`