In SLeviyang/RegressHaplo: Haplotype Reconstruction Using Penalized Regression

The folder data/ contains a BAM and associated index file for a paired-end NGS dataset. The reads are synthetic, constructed using the ART simulation package.

Running the Pipeline

Make sure you load RegressHaplo

library(RegressHaplo)

then run the RegressHaplo pipeline on the BAM file.

bam_file <- "data/example.bam"
out_dir <- "output/"
dir.create(out_dir)
full_pipeline(bam_file, out_dir, start_pos=500, end_pos=1500, num_trials=700)

We have chosen to restrict reconstruction to reference positions $500$ through $1500$ and to optimize the penalized regression $700$ times. The haplotypes returned will cover the full consensus, composed of $2500$ positions, but positions outside the $500-1500$ region will be set to consensus values. Here we restrict to such a short region for the sake of an example with relatively short run time. On a i7-gen5 machine, it took about 3 minutes to run the code directly above.

Parsing RegressHaplo Results

In this case, we directed RegressHaplo to place output files in the output/ directory. The final output file is output/final_haplo.fasta. The fasta file can be accessed directly; here we use Biostrings to read in and see the sequences.

haps <- readDNAStringSet("output/final_haplo.fasta")
haps

But RegressHaplo also provides several functions to view and analyze the reconstrution.

# determine the positions on the reference that were considered variable 
get_variable_positions.pipeline(out_dir)
# get haplotype information
info <- get_fasta.pipeline(out_dir)
# info is a list containing the elements haplotypes and freq
# freq gives the frequency of the reconstructed haplotypes
info$freq
# info$haplotype is a character vector containing the haplotypes.
class(info$haplotypes)
length(info$haplotypes)
# we can use Biostrings to see the haplotypes since the sequences are rather long
DNAStringSet(info$haplotypes)

Evaluating Regression Performance

After running RegressHaplo, check if the solutions have captured a range of haplotype reconstructions. The solutions summary data.frame describes the results of the $700$ trials. Each row in the data.frame corresponds to a single trial and provides K (the number of reconstructed haplotypes), fit (a fit measure, lower is better), and rho (the penalty parameter). Typically many trials generate the same result - meaning they find the same local minimum.

df <- get_solutions_summary.pipeline(out_dir)
# let's look at every 100th solution
df[seq(from=100,to=700,by=100),]

Check that the different rho have produced a range of K values

plot(df$rho, df$K, xlab="rho", ylab="K")

We have captured solutions with $K$ values (the number of haplotypes reconstructed) ranging from $3$ to $14$, although $K=6,10,11$ are missing. This is a good spread of $K$ values; often a particular $K$ value will be missed.

We can check the $rho$ values used

unique(df$rho)

If further $K$ values are desired, additional rho values can be specified through the full_pipeline rho parameter, see help for specifics. Increasing rho will lower $K$.

We can also check whether the final solution chosen, $K=4$, reflects a good tradeoff between over and under fitting.

#  K values vs fit
plot(df$K, df$fit, xlab="K", ylab="fit")

Here $K=5$ would have produced a better fit, but $K=4$ seems reasonable.

Finally, we can visualize the accuracy of the haplotype reconstruction in terms of the predicted vs sampled frequencies at each position. To do this, we pick a solution. For example, for $K=4$, we can pick the solution with the lowest fit.

df4 <- dplyr::filter(df, K==4) %>%
       dplyr::arrange(fit)
# the best solution is now the in the first row of df4
solution <- df4$solution_number[1]
solution

And given a particular solution, we can visualize the frequency of particular nucleotides at each variable position. Here we show A and C, but any combination of A,C,G,T,d,i are possible. The variable positions shown can also be varied, see help.

out <- solution_accuracy.pipeline(out_dir, 
                                  solution, 
                                  nucs=c("A", "C"),
                                  plot=T)