distIUPAC: calculates IUPAC distances on Biostrings DNAStringSet and BStringSet objects

title: "01. bam2IUPAC" author: "Kristian K Ullrich" date: "2019-02-13" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{01. bam2IUPAC} %\VignetteEngine{knitr::rmarkdown} %\usepackage[utf8]{inputenc}

In this vignette Users learn how to create IUPAC fasta format (IUPAC code) files from reference mapped bam files which can be used with distIUPAC to calculate IUPAC distances.

The following external tools needs to be installed to be able to obtain IUPAC fasta format files:

Genome mapper of your choice (e.g. bwa mem or NextGenMap)
Picard tools for bam file sorting and de-duplication (picard tools)
angsd for IUPAC fasta format retrieval (angsd)

A manual how the pre-requisites needs to be installed is given at the end of this vignette.

The user needs to perform five steps:

fastq files, preferentially QC pre-processed (see trimmomatic for one of many trimming tools available)
reference fasta file
build index for the reference
map reads
sort reads

#FORWARD FASTQ FILE: 1.fq
#REVERSE FASTQ FILE: 2.fq
#REFERENCE: ref.fasta
#USED THREADS: 12
#SAMPLE NAME: IND1
#SAM OUTPUT FILE: ind1.sam
#build index
bwa index ref.fasta
#map reads
bwa mem -t 12 -M -R @RG\tID:IND1\tLB:lib1\tPL:ILLUMINA\tPU:unit1\tSM:IND1 -o ind1.sam ref.fasta 1.fq 2.fq
#sort sam file
java -jar picard.jar SortSam I=ind1.sam O=ind1.sorted.bam SO=coordinate

The usere needs to perform four steps:

fastq files, preferentially QC pre-processed (see trimmomatic for one of many trimming tools available)
reference fasta file
map reads
sort reads

#FORWARD FASTQ FILE: 1.fq
#REVERSE FASTQ FILE: 2.fq
#REFERENCE: ref.fasta
#USED THREADS: 12
#SAMPLE NAME: IND1
#SAM OUTPUT FILE: ind1.bam
#map reads
ngm -1 1.fq -2 2.fq -r ref.fasta -o ind1.bam --no-unal --sensitive -t 12 --no-progress --rg-id IND1 --rg-sm IND1 --rg-lb lib1 --rg-pl illumina --rg-pu unit1 -b
#sort bam file
java -jar picard.jar SortSam I=ind1.bam O=ind1.sorted.bam SO=coordinate

In case that multiple fastq libraries exists or have been mapped without prior merging of the fastq files, the User can merge reference mapped bam files as follows:

java -jar picard.jar MergeSamFiles I=ind1_1.sorted.bam I=ind1_2.sorted.bam O=ind1.sorted.bam

java -jar picard.jar RemoveDuplicates REMOVE_DUPLICATES=true I=ind1.sorted.bam O=ind1.sorted.nodup.bam M=ind1.sorted.bam.duplicate.metrics

Chromosomes should be processed separately to be able to merge different samples into chromosome alignments, which can be processed with distIUPAC as follows:

#chromosomes should be processed separately to be able to easily merge different samples into one alignment to be processed with 'distIUPAC'
#chromosome 'chr1' will be processed here
samtools index ind1.sorted.nodup.bam
angsd -doFasta 4 -doCounts 1 -minQ 20 -minMapQ 30 -uniqueOnly -setMinDepth 5 -setMaxDepth 100 -iupacRatio 0.2 -i ind1.sorted.nodup.bam -out IND1.minQ20.minMapQ30.uniqueOnly.setMinDepth5.setMaxDepth100.chr1 -r chr1

Assuming all IUPAC fasta files are located in the same folder and same chromosome files have the same ending, individuals can be merged as follows:

#chromosome 'chr1' from different individuals will be processed here
#1. uncompress fasta files
for file in *.chr1.fa.gz;do gunzip $file;done
#2. rename fasta sequences according to file names
for file in *.chr1.fa;do sed -i 's/>chr1/>'"$file"'/g' $file;done
#3. merge fasta files
for file in *.chr1.fa;do cat $file >> chr1.fa;done

The following commands needs to be executed in R:

library(distIUPAC)
dna<-readDNAStringSet("chr1.fa")
chr1.dist<-distIUPAC(as.character(subseq(dna,1,10000)))

Short description of how to compile the external tools for a unix based system are given. See the next section for MAC OS X installation.

#download latest release from NextGenMap
wget https://codeload.github.com/Cibiv/NextGenMap/tar.gz/NextGenMap-0.5.5.tar.gz
tar -xvf v0.5.5.tar.gz
rm v0.5.5.tar.gz
cd NextGenMap-0.5.5/
mkdir -p build/
cd build/
cmake ..
make

#download latest release 'picard.jar' from broadinstitute
wget https://github.com/broadinstitute/picard/releases/download/2.17.4/picard.jar

#download latest release ANGSD from https://github.com/ANGSD/angsd
git clone https://github.com/samtools/htslib.git;
git clone https://github.com/angsd/angsd.git;
cd htslib;make;
cd ../angsd;
make HTSSRC=../htslib

For a MAC OS X system there are additional pre-requisites that needs to be installed to be able to compile all necessary software.

'Xcode' needs to be installed from App-Store

'Homebrew' needs to be installed https://brew.sh/index_de.html

/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)

'Git' needs to be installed https://git-scm.com/download/mac

download latest Git DMG Image
Open DMG Iamge and Install PKG file

'Autoconf' needs to be installed http://www.gnu.org/software/autoconf/autoconf.html

brew install autoconf

'Cmake' needs to be installed https://cmake.org/download/

download latest cmake DMG Image release https://cmake.org/files/v3.10/cmake-3.10.2-Darwin-x86_64.dmg
Open DMG Image and copy to Applications
in a Terminal

sudo "/Applications/CMake.app/Contents/bin/cmake-gui" --install

#download latest release from NextGenMap
curl -L https://github.com/Cibiv/NextGenMap/archive/v0.5.5.tar.gz > v0.5.5.tar.gz
tar -xvf v0.5.5.tar.gz
rm v0.5.5.tar.gz
cd NextGenMap-0.5.5/
mkdir -p build/
cd build/
cmake ..
make

#download latest release 'picard.jar' from broadinstitute
curl -L https://github.com/broadinstitute/picard/releases/download/2.17.4/picard.jar > picard.jar

#download latest release ANGSD from https://github.com/ANGSD/angsd
git clone https://github.com/samtools/htslib.git;
git clone https://github.com/angsd/angsd.git;
cd htslib;/usr/local/Cellar/autoconf/2.69/bin/autoconf;/usr/local/Cellar/autoconf/2.69/bin/autoheader;./configure --disable-lzma;make;
cd ../angsd;
make HTSSRC=../htslib

Korneliussen TS, Albrechtsen A and Nielsen R. ANGSD: Analysis of Next Generation Sequencing Data. BMC Bioinformatics (2014) 15:356 https://doi.org/10.1186/s12859-014-0356-4

Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics (2009) 25:1754 https://doi.org/10.1093/bioinformatics/btp324

Picard tools http://broadinstitute.github.io/picard

Sedlazeck FJ, Rescheneder P, von Haeseler A. NextGenMap: fast and accurate read mapping in highly polymorphic genomes. Bioinformatics (2013) 21:2790 https://doi.org/10.1093/bioinformatics/btt468

kullrich/distIUPAC documentation built on Jan. 9, 2020, 2:50 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

kullrich/distIUPAC
calculates IUPAC distances on Biostrings DNAStringSet and BStringSet objects

vignettes/bam2IUPAC.md
In kullrich/distIUPAC: calculates IUPAC distances on Biostrings DNAStringSet and BStringSet objects

Pre-requistes for obtaining fasta format files from reference mapped bam files

STEP 1: Mapping `fastq` files to a `reference`

Using `BWA` for mapping:

Using `NextGenMap` for mapping

OPTIONAL STEP: Merge `sorted bam` files

STEP 2: Remove duplicates from `sorted bam` file

STEP 3: Create `IUPAC fasta` file for the de-duplicated `referenced mapped bam` file

STEP 4: Repeat STEP 1 to STEP 3 for multiple individuals

STEP 5: Merge samples into chromosome alignments

Calculating `IUPAC distances` for 'chr1' with `distIUPAC`

Pre-requisites installation - unix based systems (no MAC OS X)

NextGenMap installation

Picard tools installation

ANGSD installation

Pre-requisites installation - MAC OS X systems

Xcode Command Line Tools

Homebrew

Git

Autoconf via Homebrew

Cmake

NextGenMap installation

Picard tools installation

ANGSD installation

References

R Package Documentation

Browse R Packages

We want your feedback!

kullrich/distIUPAC calculates IUPAC distances on Biostrings DNAStringSet and BStringSet objects

vignettes/bam2IUPAC.md In kullrich/distIUPAC: calculates IUPAC distances on Biostrings DNAStringSet and BStringSet objects

Pre-requistes for obtaining fasta format files from reference mapped bam files

STEP 1: Mapping fastq files to a reference

Using BWA for mapping:

Using NextGenMap for mapping

OPTIONAL STEP: Merge sorted bam files

STEP 2: Remove duplicates from sorted bam file

STEP 3: Create IUPAC fasta file for the de-duplicated referenced mapped bam file

STEP 4: Repeat STEP 1 to STEP 3 for multiple individuals

STEP 5: Merge samples into chromosome alignments

Calculating IUPAC distances for 'chr1' with distIUPAC

Pre-requisites installation - unix based systems (no MAC OS X)

NextGenMap installation

Picard tools installation

ANGSD installation

Pre-requisites installation - MAC OS X systems

Xcode Command Line Tools

Homebrew

Git

Autoconf via Homebrew

Cmake

NextGenMap installation

Picard tools installation

ANGSD installation

References

R Package Documentation

Browse R Packages

We want your feedback!

kullrich/distIUPAC
calculates IUPAC distances on Biostrings DNAStringSet and BStringSet objects

vignettes/bam2IUPAC.md
In kullrich/distIUPAC: calculates IUPAC distances on Biostrings DNAStringSet and BStringSet objects

STEP 1: Mapping `fastq` files to a `reference`

Using `BWA` for mapping:

Using `NextGenMap` for mapping

OPTIONAL STEP: Merge `sorted bam` files

STEP 2: Remove duplicates from `sorted bam` file

STEP 3: Create `IUPAC fasta` file for the de-duplicated `referenced mapped bam` file

Calculating `IUPAC distances` for 'chr1' with `distIUPAC`