vignettes/distStats.md

title: "02. distStats" author: "Kristian K Ullrich" date: "2019-02-13" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{02. distStats} %\VignetteEngine{knitr::rmarkdown} %\usepackage[utf8]{inputenc}

In this vignette Users learn how to use distIUPAC to calculate IUPAC distances based on IUPAC fasta format (IUPAC code) files.

Next to learn how to calculate a distance matrix for one sequence alignment also within population distances (xStats) and between populations distances (xyStats, xyMultiStats and xyzStats__) based on sliding windows will be shown.

Pre-requistes for obtaining fasta format files from reference mapped bam files

One pre-requistes is a fasta format alignment file (same length for all sequences) either in non IUPAC fasta format ('ACTG-N') or in IUPAC fasta format_ ('ACGTRYSWKMBDHV.-NX').

To get IUPAC fasta format files from reference mapped bam files see the vignette 01. bam2IUPAC vignette.

STEP 1: Loading

Using BWA for mapping:

The user needs to perform five steps:

  1. fastq files, prefernetially QC pre-processed (see trimmomatic for one of many trimming tools available)
  2. reference fasta file
  3. build index for the reference
  4. map reads
  5. sort reads
#FORWARD FASTQ FILE: 1.fq
#REVERSE FASTQ FILE: 2.fq
#REFERENCE: ref.fasta
#USED THREADS: 12
#SAMPLE NAME: IND1
#SAM OUTPUT FILE: ind1.sam
#build index
bwa index ref.fasta
#map reads
bwa mem -t 12 -M -R @RG\tID:IND1\tLB:lib1\tPL:ILLUMINA\tPU:unit1\tSM:IND1 -o ind1.sam ref.fasta 1.fq 2.fq
#sort sam file
java -jar picard.jar SortSam I=ind1.sam O=ind1.sorted.bam SO=coordinate

Using NextGenMap for mapping

The usere needs to perform four steps:

  1. fastq files, prefernetially QC pre-processed (see trimmomatic for one of many trimming tools available)
  2. reference fasta file
  3. map reads
  4. sort reads
#FORWARD FASTQ FILE: 1.fq
#REVERSE FASTQ FILE: 2.fq
#REFERENCE: ref.fasta
#USED THREADS: 12
#SAMPLE NAME: IND1
#SAM OUTPUT FILE: ind1.bam
#map reads
ngm -1 1.fq -2 2.fq -r ref.fasta -o ind1.bam --no-unal --sensitive -t 12 --no-progress --rg-id IND1 --rg-sm IND1 --rg-lb lib1 --rg-pl illumina --rg-pu unit1 -b
#sort bam file
java -jar picard.jar SortSam I=ind1.bam O=ind1.sorted.bam SO=coordinate

OPTIONAL STEP: Merge sorted bam files

In case that multiple fastq libraries exists or have been mapped without prior merging of the fastq files, the User can merge reference mapped bam files as follows:

java -jar picard.jar MergeSamFiles I=ind1_1.sorted.bam I=ind1_2.sorted.bam O=ind1.sorted.bam

STEP 2: Remove duplicates from sorted bam file

java -jar picard.jar RemoveDuplicates REMOVE_DUPLICATES=true I=ind1.sorted.bam O=ind1.sorted.nodup.bam M=ind1.sorted.bam.duplicate.metrics

STEP 3: Create IUPAC fasta file for the de-duplicated referenced mapped bam file

Chromosomes should be processed separately to be able to merge different samples into chromosome alignments, which can be processed with distIUPAC as follows:

#chromosomes should be processed separately to be able to easily merge different samples into one alignment to be processed with 'distIUPAC'
#chromosome 'chr1' will be processed here
samtools index ind1.sorted.nodup.bam
angsd -doFasta 4 -doCounts 1 -minQ 20 -minMapQ 30 -uniqueOnly -setMinDepth 5 -setMaxDepth 100 -iupacRatio 0.2 -i ind1.sorted.nodup.bam -out IND1.minQ20.minMapQ30.uniqueOnly.setMinDepth5.setMaxDepth100.chr1 -r chr1

STEP 4: Repeat STEP 1 to STEP 3 for multiple individuals

STEP 5: Merge samples into chromosome alignments

Assuming all IUPAC fasta files are located in the same folder and same chromosome files have the same ending, individuals can be merged as follows:

#chromosome 'chr1' from different individuals will be processed here
#1. uncompress fasta files
for file in *.chr1.fa.gz;do gunzip $file;done
#2. rename fasta sequences according to file names
for file in *.chr1.fa;do sed -i 's/>chr1/>'"$file"'/g' $file;done
#3. merge fasta files
for file in *.chr1.fa;do cat $file >> chr1.fa;done

Calculating IUPAC distances for 'chr1' with distIUPAC

The following commands needs to be executed in R:

library(distIUPAC)
dna<-readDNAStringSet("chr1.fa")
chr1.dist<-distIUPAC(as.character(subseq(dna,1,10000)))

References

Korneliussen TS, Albrechtsen A and Nielsen R. ANGSD: Analysis of Next Generation Sequencing Data. BMC Bioinformatics (2014) 15:356 https://doi.org/10.1186/s12859-014-0356-4

Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics (2009) 25:1754 https://doi.org/10.1093/bioinformatics/btp324

Picard tools http://broadinstitute.github.io/picard

Sedlazeck FJ, Rescheneder P, von Haeseler A. NextGenMap: fast and accurate read mapping in highly polymorphic genomes. Bioinformatics (2013) 21:2790 https://doi.org/10.1093/bioinformatics/btt468



kullrich/distIUPAC documentation built on Jan. 9, 2020, 2:50 p.m.