In Benji-Wagner/SequenceMapper: Map Reads to Reference Barcodes

knitr::opts_chunk$set(collapse = TRUE, comment = "#>")

This package aims to educate users on how sample DNA reads from high-throughput sequencing processes can be mapped to reference barcodes. This package comes with some sample data that can be loaded and ran with the included functions. The reads dataset comes with the read identifier, the called read, and the read's Phred Score. The barcodes dataset comes with the barcode identifier and the reference sequence. The motivating problem is we want to find the number of reads that map to a particular barcode, but we want to allow for mismatches. Thus, by using the Hamming Distance, we can map reads to barcodes that might not be an exact match.

Using the Phred Scores, we can also calculate the probability that a given read maps to a particular barcode by using a probabilistic error model. The joint probability distribution of the originating reads conditional on the corresponding observed reads, $P(\bf{\tilde{r}}^k | \bf{r})$, under the assumption of conditional independence:

$$ \begin{align} P(\tilde{\bf{r}}^k | \textbf{r}) &= P(\tilde{R} = \tilde{\bf{r}}^k | R = \textbf{r}) \ &= P(\tilde{R} = [\tilde{r_1}, \dots, \tilde{r_L}] | R = [r_1, \dots, r_L]) \ &= \prod^L_{i=1} P(\tilde{R_j} = \tilde{r_j}| R = [r_1, \dots, r_L]) \ &= \prod^L_{i=1} P(\tilde{R_j} = \tilde{r_j}| R_j = r_j). \end{align} $$

The marginal conditional probability is modeled as:

$$ P(\tilde{R_j} = \tilde{r_j}| R_j = r_j) = \begin{cases} 1 - \epsilon_j & \tilde{r_j} = r_j \ \frac{\epsilon_j}{3} & \tilde{r_j} \neq r_j \end{cases} $$

This model is adapted from Heng et al [1].

Installation

library(devtools)
# if you don't have the package, run install.packages("devtools")
devtools::install_github("Benji-Wagner/SequenceMapper", build_opts = c("--no-resave-data", "--no-manual"))
# must specify the build_opts to load the vignette

Use Cases

The SequenceMapper package comes loaded with randomly generated FASTQ and FASTA files.

These files can be loaded like so:

library(SequenceMapper)
reads_table <- get_reads_table()
barcodes_table <- get_barcode_table()

By calling the map_reads function, we can map all the reads in reads_table to a barcode in barcodes_table. This is done by computing the Hamming Distance between a given read and every barcode, then filtering out the barcodes that are not within a certain Hamming Distance, which we can specify with the argument distance_threshold. Out of these filtered barcodes, we compute the joint probability of the read mapping to each barcode by using the read's Phred Score. Finally, we take the first barcode with the greatest joint probability of mapping and add that barcode to the corresponding read in the reads_table.

This process has been largely automated, and can be completed like so:

library(dplyr)
# if you don't have this package, run install.packages("dplyr")

mapped_reads <- map_reads(reads_table = reads_table, 
  barcodes_table = barcodes_table, distance_threshold = 2)

Now let's see the number of reads that mapped to a particular barcode and the proportion of mapped reads that mapped to it:

mapped_reads %>% 
  group_by(mapped_barcodes) %>%
  summarize(Number_Mapped = n(), Proportion_Mapped = n()/nrow(.))

Here's a quick look at the distribution of reads mapped to a barcode:

library(ggplot2)
# if you don't have this package, run install.packages("ggplot2")

mapped_reads %>% 
  group_by(mapped_barcodes) %>%
  summarize(Number_Mapped = n(), Proportion_Mapped = n()/nrow(.)) %>% 
  ggplot(mapping = aes(x = Number_Mapped)) + 
  geom_histogram(bins = 40) +
  xlab("Number of Reads Mapped") +
  ylab("Count") +
  ggtitle("Reads Mapped Per Barcode")