vignettes/bam2IUPAC.md

title: "01. bam2IUPAC" author: "Kristian K Ullrich" date: "2019-02-13" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{01. bam2IUPAC} %\VignetteEngine{knitr::rmarkdown} %\usepackage[utf8]{inputenc}

In this vignette Users learn how to create IUPAC fasta format (IUPAC code) files from reference mapped bam files which can be used with distIUPAC to calculate IUPAC distances.

Pre-requistes for obtaining fasta format files from reference mapped bam files

The following external tools needs to be installed to be able to obtain IUPAC fasta format files:

  1. Genome mapper of your choice (e.g. bwa mem or NextGenMap)
  2. Picard tools for bam file sorting and de-duplication (picard tools)
  3. angsd for IUPAC fasta format retrieval (angsd)

A manual how the pre-requisites needs to be installed is given at the end of this vignette.

STEP 1: Mapping fastq files to a reference

Using BWA for mapping:

The user needs to perform five steps:

  1. fastq files, preferentially QC pre-processed (see trimmomatic for one of many trimming tools available)
  2. reference fasta file
  3. build index for the reference
  4. map reads
  5. sort reads
#FORWARD FASTQ FILE: 1.fq
#REVERSE FASTQ FILE: 2.fq
#REFERENCE: ref.fasta
#USED THREADS: 12
#SAMPLE NAME: IND1
#SAM OUTPUT FILE: ind1.sam
#build index
bwa index ref.fasta
#map reads
bwa mem -t 12 -M -R @RG\tID:IND1\tLB:lib1\tPL:ILLUMINA\tPU:unit1\tSM:IND1 -o ind1.sam ref.fasta 1.fq 2.fq
#sort sam file
java -jar picard.jar SortSam I=ind1.sam O=ind1.sorted.bam SO=coordinate

Using NextGenMap for mapping

The usere needs to perform four steps:

  1. fastq files, preferentially QC pre-processed (see trimmomatic for one of many trimming tools available)
  2. reference fasta file
  3. map reads
  4. sort reads
#FORWARD FASTQ FILE: 1.fq
#REVERSE FASTQ FILE: 2.fq
#REFERENCE: ref.fasta
#USED THREADS: 12
#SAMPLE NAME: IND1
#SAM OUTPUT FILE: ind1.bam
#map reads
ngm -1 1.fq -2 2.fq -r ref.fasta -o ind1.bam --no-unal --sensitive -t 12 --no-progress --rg-id IND1 --rg-sm IND1 --rg-lb lib1 --rg-pl illumina --rg-pu unit1 -b
#sort bam file
java -jar picard.jar SortSam I=ind1.bam O=ind1.sorted.bam SO=coordinate

OPTIONAL STEP: Merge sorted bam files

In case that multiple fastq libraries exists or have been mapped without prior merging of the fastq files, the User can merge reference mapped bam files as follows:

java -jar picard.jar MergeSamFiles I=ind1_1.sorted.bam I=ind1_2.sorted.bam O=ind1.sorted.bam

STEP 2: Remove duplicates from sorted bam file

java -jar picard.jar RemoveDuplicates REMOVE_DUPLICATES=true I=ind1.sorted.bam O=ind1.sorted.nodup.bam M=ind1.sorted.bam.duplicate.metrics

STEP 3: Create IUPAC fasta file for the de-duplicated referenced mapped bam file

Chromosomes should be processed separately to be able to merge different samples into chromosome alignments, which can be processed with distIUPAC as follows:

#chromosomes should be processed separately to be able to easily merge different samples into one alignment to be processed with 'distIUPAC'
#chromosome 'chr1' will be processed here
samtools index ind1.sorted.nodup.bam
angsd -doFasta 4 -doCounts 1 -minQ 20 -minMapQ 30 -uniqueOnly -setMinDepth 5 -setMaxDepth 100 -iupacRatio 0.2 -i ind1.sorted.nodup.bam -out IND1.minQ20.minMapQ30.uniqueOnly.setMinDepth5.setMaxDepth100.chr1 -r chr1

STEP 4: Repeat STEP 1 to STEP 3 for multiple individuals

STEP 5: Merge samples into chromosome alignments

Assuming all IUPAC fasta files are located in the same folder and same chromosome files have the same ending, individuals can be merged as follows:

#chromosome 'chr1' from different individuals will be processed here
#1. uncompress fasta files
for file in *.chr1.fa.gz;do gunzip $file;done
#2. rename fasta sequences according to file names
for file in *.chr1.fa;do sed -i 's/>chr1/>'"$file"'/g' $file;done
#3. merge fasta files
for file in *.chr1.fa;do cat $file >> chr1.fa;done

Calculating IUPAC distances for 'chr1' with distIUPAC

The following commands needs to be executed in R:

library(distIUPAC)
dna<-readDNAStringSet("chr1.fa")
chr1.dist<-distIUPAC(as.character(subseq(dna,1,10000)))

Pre-requisites installation - unix based systems (no MAC OS X)

Short description of how to compile the external tools for a unix based system are given. See the next section for MAC OS X installation.

NextGenMap installation

#download latest release from NextGenMap
wget https://codeload.github.com/Cibiv/NextGenMap/tar.gz/NextGenMap-0.5.5.tar.gz
tar -xvf v0.5.5.tar.gz
rm v0.5.5.tar.gz
cd NextGenMap-0.5.5/
mkdir -p build/
cd build/
cmake ..
make

Picard tools installation

#download latest release 'picard.jar' from broadinstitute
wget https://github.com/broadinstitute/picard/releases/download/2.17.4/picard.jar

ANGSD installation

#download latest release ANGSD from https://github.com/ANGSD/angsd
git clone https://github.com/samtools/htslib.git;
git clone https://github.com/angsd/angsd.git;
cd htslib;make;
cd ../angsd;
make HTSSRC=../htslib

Pre-requisites installation - MAC OS X systems

For a MAC OS X system there are additional pre-requisites that needs to be installed to be able to compile all necessary software.

Xcode Command Line Tools

'Xcode' needs to be installed from App-Store

Homebrew

'Homebrew' needs to be installed https://brew.sh/index_de.html

/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)

Git

'Git' needs to be installed https://git-scm.com/download/mac

  1. download latest Git DMG Image
  2. Open DMG Iamge and Install PKG file

Autoconf via Homebrew

'Autoconf' needs to be installed http://www.gnu.org/software/autoconf/autoconf.html

brew install autoconf

Cmake

'Cmake' needs to be installed https://cmake.org/download/

  1. download latest cmake DMG Image release https://cmake.org/files/v3.10/cmake-3.10.2-Darwin-x86_64.dmg
  2. Open DMG Image and copy to Applications
  3. in a Terminal
sudo "/Applications/CMake.app/Contents/bin/cmake-gui" --install

NextGenMap installation

#download latest release from NextGenMap
curl -L https://github.com/Cibiv/NextGenMap/archive/v0.5.5.tar.gz > v0.5.5.tar.gz
tar -xvf v0.5.5.tar.gz
rm v0.5.5.tar.gz
cd NextGenMap-0.5.5/
mkdir -p build/
cd build/
cmake ..
make

Picard tools installation

#download latest release 'picard.jar' from broadinstitute
curl -L https://github.com/broadinstitute/picard/releases/download/2.17.4/picard.jar > picard.jar

ANGSD installation

#download latest release ANGSD from https://github.com/ANGSD/angsd
git clone https://github.com/samtools/htslib.git;
git clone https://github.com/angsd/angsd.git;
cd htslib;/usr/local/Cellar/autoconf/2.69/bin/autoconf;/usr/local/Cellar/autoconf/2.69/bin/autoheader;./configure --disable-lzma;make;
cd ../angsd;
make HTSSRC=../htslib

References

Korneliussen TS, Albrechtsen A and Nielsen R. ANGSD: Analysis of Next Generation Sequencing Data. BMC Bioinformatics (2014) 15:356 https://doi.org/10.1186/s12859-014-0356-4

Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics (2009) 25:1754 https://doi.org/10.1093/bioinformatics/btp324

Picard tools http://broadinstitute.github.io/picard

Sedlazeck FJ, Rescheneder P, von Haeseler A. NextGenMap: fast and accurate read mapping in highly polymorphic genomes. Bioinformatics (2013) 21:2790 https://doi.org/10.1093/bioinformatics/btt468



kullrich/distIUPAC documentation built on Jan. 9, 2020, 2:50 p.m.