An Introduction to microhaplot data preparation
In microhaplot: Microhaplotype Constructor and Visualizer

This vignette concerns itself with an overview of microhaplot and on preparation of data for input into microhaplot Once your data has been transferred to the appropriate Haplot folder, see the microhaplot walkthrough vignette to learn more about microhaplot's graphical features.

How does microhaplot work?

microhaplot is a tool that makes it easy to extract, visualize, and interpret short-read haplotypes---or "microhaplotypes" in the parlance of some---from aligned sequences. Before we go further, it will be good to step back and confirm of what we are calling as a short-read haplotype, or a "microhaplotype" within the context of microhaplot.

What is a microhaplotype?

To define a microhaplotype you need to first define a reference point or position that a read might align to. Then, microhaplot can analyze the reads aligning at those positions, and extracts the variant sequence of the reads at those positions. For example, let's say that we know SNPs occur at positions 19, 27, 68, and 101 of a particular reference sequence that we have aligned our reads to. If a read aligns to the ref and shows an A at position 19, a G at position 27, a C at position 68, and another C at position 101, then we say the microhaplotype on that read is AGCC.

microhaplot is a tool that

automates the extraction of these haplotypes from aligned data (typically amplicon sequencing data)
provides useful filtering and visualizing capabilities for understanding your microhaplotype data and for calling genotypes at microhaplotypes.

To understand what is involved in getting your data into microhaplot, it will be useful to have a brief introduction to what it does and how it operates. We will start by listing what it is not.

microhaplot is not a SNP-caller or variant detector. It is not its job to determine if a variant might exist in a particular location or not. We assume you already know what variant position you want to look for by supplying microhaplot a variant caller file (VCF). microhaplot just extracts the bases sequenced on different reads at those positions.
microhaplot is not an alignment program. microhaplot does not do alignments. You have to provide aligned data to it. From the aligned data, microhaplot can extract the bases sequenced at the desired locations.
microhaplot is not a haplotype phasing program. There is no special algorithm going on in microhaplot to try to assemble haplotypes (across reads or loci, for example.) It merely uses the simple fact that a single sequencing read will come from a single chromosome. So, if you can draw bases present at 4 variant positions along a 150-bp read, you know that those 4 bases constitute a haplotype.
microhaplot is not a de-novo assembly algorithm. This package will not infer or sort out any physical linkage between reads of different reference loci.

What kind of data can I use with `microhaplot`?

'microhaplot' is designed to work with short read sequencing data where there is a relatively small number sequencing start sites. It is emphatically not geared to whole-genome shotgun sequencing data in which you may have overlapping reads with different starting place. We use 'microhaplot' primarily with amplicon sequencing data in which a few hundred fragments (of about 100 - 150 bp each) are amplified by PCR in a few hundred individuals. The DNA from different individuals is bar-coded so that it may be demultiplexed later, and then these amplified fragments are sequenced (usually to fairly high read depth) on an Illumina platform.

Though we use 'microhaplot' for amplicon sequencing, it might also be useful for analyzing haplotype data from experiments that sequence captured or baited fragments, (e.g., baited DD-Rad, RAPTURE, myBaits, ExonCapture, etc.), and might even be applicable to standard ddRAD or RAD-PE experiments, however, as the number of loci increases and average read depths per individual-locus combination decreases, the unevenness of coverage can become a bigger problem, (and we have more reservations about allelic dropout/null alleles when using digestion-based protocols (ddRAD, Rapture, etc.) than we do with amplicon sequencing).

Why did we write 'microhaplot'?

We tried to use existing tools to give us what we needed for our amplicon-based microhaplotypes.

We were hopeful that STACKS would provide us useful microhaplotypes. What we found was that the presence of indels caused STACKS to fail, and even if it didn't we never found a reliable way to export inferred read haplotypes. (We could see what looked like haplotypes in the SQL-based viewer, but couldn't figure out how to get at those in an exported fashion!)
We spent countless months trying to get freebayes to properly call microhaplotypes. Despite being an excellent variant detector, however, when presented with situations that were clearly simple microhaplotypes, freebayes would fail to call them, or would return results in a very difficult to manage "multinucleotide polymorphism" in the VCF.
We thought that GATK's Unified Haplotype Caller might deliver us from the bioinformatic hell we were experiencing, but alas, with the super-high read depths at some of the fragments in our amplicon sequencing data, GATK was brought to its knees and didn't offer a viable solution. It appears that the HaplotypeCaller is trying to solve a much harder problem than just extracting short-read haplotypes.

After a few months of this, we realized that there is no software tailored for the relatively straightforward task of extracting and analyzing microhaplotypes. The other software programs have all been designed for different purposes, and, as a consequence, while they work great for identifying variants or performing de-novo assemblies, one might not expect them to work for the specialized task of extracting and analyzing microhaplotypes.

How, specifically, should I prepare my data?

The starting point for microhaplot requires:

An indexed, SAM file of reads aligned to the reference sequences for each amplicon.
A VCF file whose contents define the positions in each reference sequence from which you want to extract variants into microhaplotypes. Note that the individuals that appear in the VCF file need not be the same ones whose reads appear in the assemblies. The VCF is merely used as a way of defining positions to extract. Still, since the content of microhaplotype is based on the variant positions listed in the VCF file, it is crucial for the VCF file to only contain highly reliable variant sites.

We describe here the workflow that we use to get these two necessary files from the short-read amplicon sequencing data that come off our Illumina Mi-seq. This workflow can be tweaked and tailored for your own data. Each step is described in one of the following subsections.

Starting state

Out description starts at the point where we have copied all of our fastq.gz files from the sequencer into a directory called rawdata. Next to that directory we have made two more directories, flash and map, in which we will be creating files and doing work. We have named all the Illumina files so that a directory listing of rawdata looks like this: ```{sh dir-list, eval = FALSE} /rawdata/--% ll | head total 3517864 drwxrwxr-x 2 biopipe biopipe 122880 Aug 18 11:57 ./ drwxrwxr-x 5 biopipe biopipe 4096 Aug 18 11:31 ../ -rw-r--r-- 1 biopipe biopipe 672293 Aug 18 11:53 satro_S100_L001_I1_001.fastq.gz -rw-r--r-- 1 biopipe biopipe 725385 Aug 18 11:53 satro_S100_L001_I2_001.fastq.gz -rw-r--r-- 1 biopipe biopipe 4656145 Aug 18 11:53 satro_S100_L001_R1_001.fastq.gz -rw-r--r-- 1 biopipe biopipe 4749637 Aug 18 11:53 satro_S100_L001_R2_001.fastq.gz -rw-r--r-- 1 biopipe biopipe 635863 Aug 18 11:53 satro_S101_L001_I1_001.fastq.gz -rw-r--r-- 1 biopipe biopipe 711534 Aug 18 11:53 satro_S101_L001_I2_001.fastq.gz -rw-r--r-- 1 biopipe biopipe 4410826 Aug 18 11:53 satro_S101_L001_R1_001.fastq.gz

In this case, there are 384 such sets of files.  The number after the capital 
S in the filename gives the ID number of each individual fish.

### Flashing the reads together
Our data are paired end reads of fragments that are short enough that the primary reads
and the paired-end reads overlap.  Rather than treat them as two separate reads,
we combine them together into a single read using [flash](https://ccb.jhu.edu/software/FLASH/).
Here is the command we use to cycle over all the files and flash them, run from the `flash` directory
which is at the same level as the `rawdata` directory.  (Note that the `/flash/--%` in the following code
fragment is just the command prompt---it tells you which directory we are in...)
```{sh flash, eval = FALSE}
/flash/--% for i in {1..384}; do flash -m 10 -M 100 -z --output-prefix=S${i} ../rawdata/satro_S${i}_L001_R1_001.fastq.gz ../rawdata/satro_S${i}_L001_R2_001.fastq.gz; done

The output that flash spits out to stdout looks like this: ```{sh flash-output, eval = FALSE} [FLASH] Starting FLASH v1.2.11 [FLASH] Fast Length Adjustment of SHort reads [FLASH] [FLASH] Input files: [FLASH] ../rawdata/satro_S1_L001_R1_001.fastq.gz [FLASH] ../rawdata/satro_S1_L001_R2_001.fastq.gz [FLASH] [FLASH] Output files: [FLASH] ./S1.extendedFrags.fastq.gz [FLASH] ./S1.notCombined_1.fastq.gz [FLASH] ./S1.notCombined_2.fastq.gz [FLASH] ./S1.hist [FLASH] ./S1.histogram [FLASH] [FLASH] Parameters: [FLASH] Min overlap: 10 [FLASH] Max overlap: 100 [FLASH] Max mismatch density: 0.250000 [FLASH] Allow "outie" pairs: false [FLASH] Cap mismatch quals: false [FLASH] Combiner threads: 12 [FLASH] Input format: FASTQ, phred_offset=33 [FLASH] Output format: FASTQ, phred_offset=33, gzip [FLASH]

This has created a series of files named like `S${i}.extendedFrags.fastq.gz` in which the `${i}` is replaced by 
a number between 1 and 384.  These files are gzipped fastq files that hold the flashed reads.

### Mapping/Aligning the Reads

The reference sequences we will use to align to are in a fasta file called `gtseq15_loci.fasta`.  We put that in the `map` directory
and then commence mapping the flashed reads as follows:
```{sh mapping, eval=FALSE}
# index the reference. Output will be prefixed with "gtseq15_loci"
/map/--% bwa index -p gtseq15_loci -a is gtseq15_loci.fasta

# here is what comes by on stdout 
[bwa_index] Pack FASTA... 0.00 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 0.00 seconds elapse.
[bwa_index] Update BWT... 0.00 sec
[bwa_index] Pack forward-only FASTA... 0.00 sec
[bwa_index] Construct SA from BWT and Occ... 0.00 sec
[main] Version: 0.7.5a-r405
[main] CMD: bwa index -p gtseq15_loci -a is gtseq15_loci.fasta
[main] Real time: 0.061 sec; CPU: 0.007 sec


# map reads. for each input file, the output is named satro_flashed_s${i}_aln.sam
# Note the -R to insert appropriate read group header lines
/map/--% for i in {1..384}; do bwa mem -aM -v 3 -t 10 -R "@RG\tID:s${i}\tLB:amplicon\tPL:ILLUMINA\tSM:satro${i}" ./gtseq15_loci ../flash/S${i}.extendedFrags.fastq.gz  > ./satro_flashed_s${i}_aln.sam; done


# convert the SAMs to BAMs 
/map/--% for i in {1..384}; do samtools view -bhS satro_flashed_s${i}_aln.sam > satro_flashed_s${i}.bam; done


# Sort the BAMs
/map/--% for i in {1..384}; do samtools sort satro_flashed_s${i}.bam -o satro_flashed_s${i}_sorted.bam; done


# Index the BAMs
/map/--% for i in {1..384}; do samtools index satro_flashed_s${i}_sorted.bam; done


# Make a text-file list of the BAMs
/map/--% ls -1 *sorted.bam > bamlist.txt


# generate idx stats and put them in a directory called gtseq17_idx
/map/--% mkdir gtseq17_idx
/map/--% for i in {1..384}; do samtools idxstats satro_flashed_s${i}_sorted.bam > ./gtseq17_idx/s${i}_idxstats.txt; done

Variant detection

We are now set up to detect variants using freebayes. We show how we do that here, but emphasize that if you have already chosen a set of variants that form your microhaplotypes, you could just specify those with a pre-existing VCF file. You don't need to do new variant detection if you don't want to. (Of course, if it doesn't take too long, you may as well do variant detection with every individual you have sequenced at the amplicons, so as to get as many variants as possible.)

```{sh freebayes, eval=FALSE}

index fasta reference file

/map/--% samtools faidx gtseq15_loci.fasta

call variants, instructing freebayes to NOT return MNPs or other complex variants

nohup freebayes-parallel <(fasta_generate_regions.py gtseq15_loci.fasta.fai 150) 10 -f gtseq15_loci.fasta -L bamlist.txt --haplotype-length 0 -kwVa --no-mnps --no-complex > satro384_noMNP_noComplex_noPriors.vcf

### runHaplot

After the above is done, the file `satro384_noMNP_noComplex_noPriors.vcf` can be filtered, if desired, to make 
sure that everything in it has a solid SNP.  Then that file is used with the function
`prepHaplotFiles` to extract haplotypes from the aligned reads in the `map` directory.  
An example of an vcf file and SAM files extracted from an actual GT-seq rockfish 
data is available in the `inst/extdata`.


```r

library(microhaplot)

run.label <- "sebastes"

sam.path <- tempdir()
untar(system.file("extdata",
                  "sebastes_sam.tar.gz",
                  package="microhaplot"),
      exdir = sam.path)


label.path <- file.path(sam.path, "label.txt")
vcf.path <- file.path(sam.path, "sebastes.vcf")

mvShinyHaplot(tempdir())
app.path <- file.path(tempdir(), "microhaplot")

haplo.read.tbl <- prepHaplotFiles(run.label = run.label,
                            sam.path = sam.path,
                            out.path = tempdir(),
                            label.path = label.path,
                            vcf.path = vcf.path,
                            app.path = app.path)