title: "02. distStats" author: "Kristian K Ullrich" date: "2019-02-13" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{02. distStats} %\VignetteEngine{knitr::rmarkdown} %\usepackage[utf8]{inputenc}
In this vignette Users learn how to use distIUPAC to calculate IUPAC distances
based on IUPAC fasta format
(IUPAC code) files.
Next to learn how to calculate a distance matrix for one sequence alignment also within population distances (xStats
) and between populations distances (xyStats
, xyMultiStats
and xyzStats
__) based on sliding windows will be shown.
One pre-requistes is a fasta format alignment file (same length for all sequences) either in non IUPAC fasta format
('ACTG-N') or in IUPAC fasta format
_ ('ACGTRYSWKMBDHV.-NX').
To get IUPAC fasta format
files from reference mapped bam
files see the vignette 01. bam2IUPAC vignette.
BWA
for mapping:The user needs to perform five steps:
fastq
files, prefernetially QC
pre-processed (see trimmomatic for one of many trimming tools available)fasta
filebuild index
for the referencemap reads
sort reads
#FORWARD FASTQ FILE: 1.fq
#REVERSE FASTQ FILE: 2.fq
#REFERENCE: ref.fasta
#USED THREADS: 12
#SAMPLE NAME: IND1
#SAM OUTPUT FILE: ind1.sam
#build index
bwa index ref.fasta
#map reads
bwa mem -t 12 -M -R @RG\tID:IND1\tLB:lib1\tPL:ILLUMINA\tPU:unit1\tSM:IND1 -o ind1.sam ref.fasta 1.fq 2.fq
#sort sam file
java -jar picard.jar SortSam I=ind1.sam O=ind1.sorted.bam SO=coordinate
NextGenMap
for mappingThe usere needs to perform four steps:
fastq
files, prefernetially QC
pre-processed (see trimmomatic for one of many trimming tools available)fasta
filemap reads
sort reads
#FORWARD FASTQ FILE: 1.fq
#REVERSE FASTQ FILE: 2.fq
#REFERENCE: ref.fasta
#USED THREADS: 12
#SAMPLE NAME: IND1
#SAM OUTPUT FILE: ind1.bam
#map reads
ngm -1 1.fq -2 2.fq -r ref.fasta -o ind1.bam --no-unal --sensitive -t 12 --no-progress --rg-id IND1 --rg-sm IND1 --rg-lb lib1 --rg-pl illumina --rg-pu unit1 -b
#sort bam file
java -jar picard.jar SortSam I=ind1.bam O=ind1.sorted.bam SO=coordinate
sorted bam
filesIn case that multiple fastq
libraries exists or have been mapped without prior merging of the fastq
files, the User can merge reference mapped bam
files as follows:
java -jar picard.jar MergeSamFiles I=ind1_1.sorted.bam I=ind1_2.sorted.bam O=ind1.sorted.bam
sorted bam
filejava -jar picard.jar RemoveDuplicates REMOVE_DUPLICATES=true I=ind1.sorted.bam O=ind1.sorted.nodup.bam M=ind1.sorted.bam.duplicate.metrics
IUPAC fasta
file for the de-duplicated referenced mapped bam
fileChromosomes should be processed separately to be able to merge different samples into chromosome alignments
, which can be processed with distIUPAC
as follows:
#chromosomes should be processed separately to be able to easily merge different samples into one alignment to be processed with 'distIUPAC'
#chromosome 'chr1' will be processed here
samtools index ind1.sorted.nodup.bam
angsd -doFasta 4 -doCounts 1 -minQ 20 -minMapQ 30 -uniqueOnly -setMinDepth 5 -setMaxDepth 100 -iupacRatio 0.2 -i ind1.sorted.nodup.bam -out IND1.minQ20.minMapQ30.uniqueOnly.setMinDepth5.setMaxDepth100.chr1 -r chr1
Assuming all IUPAC fasta
files are located in the same folder and same chromosome files have the same ending, individuals can be merged as follows:
#chromosome 'chr1' from different individuals will be processed here
#1. uncompress fasta files
for file in *.chr1.fa.gz;do gunzip $file;done
#2. rename fasta sequences according to file names
for file in *.chr1.fa;do sed -i 's/>chr1/>'"$file"'/g' $file;done
#3. merge fasta files
for file in *.chr1.fa;do cat $file >> chr1.fa;done
IUPAC distances
for 'chr1' with distIUPAC
The following commands needs to be executed in R
:
library(distIUPAC)
dna<-readDNAStringSet("chr1.fa")
chr1.dist<-distIUPAC(as.character(subseq(dna,1,10000)))
Korneliussen TS, Albrechtsen A and Nielsen R. ANGSD: Analysis of Next Generation Sequencing Data. BMC Bioinformatics (2014) 15:356 https://doi.org/10.1186/s12859-014-0356-4
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics (2009) 25:1754 https://doi.org/10.1093/bioinformatics/btp324
Picard tools http://broadinstitute.github.io/picard
Sedlazeck FJ, Rescheneder P, von Haeseler A. NextGenMap: fast and accurate read mapping in highly polymorphic genomes. Bioinformatics (2013) 21:2790 https://doi.org/10.1093/bioinformatics/btt468
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.