In usnistgov/peprr: R pepr utilities

library(peprDnaStability)
library(peprr)
library(Hmisc)
library(magrittr)
library(ggplot2)
library(dplyr)
library(tidyr)
library(png)
library(jpeg)
library(grid)
library(knitr)
require(kfigr)
opts_chunk$set(message=FALSE, warning=FALSE, echo = FALSE)

````r source("rm_metadata.R") source("calc_results_values.R")

```r 
peprDB <- dplyr::src_sqlite(db_path)

Introduction

General background

Increasingly, high stakes decisions impacting public health and safety are being made using microbial genomic sequencing data [@Tang2014]. For example, whole genome sequencing was used as part of an investigation of the European 2011 Escherichia coli O1O4:H4 sprout associated outbreak [@Grad2012]. As the stakes increase so does the required level of confidence in the measurement. This reference material as well as three other microbial genomic DNA RM8375-8378 were developed to help advance the measurement assurance of microbial genomic sequencing and DNA sequencing in general. The development of these reference material was supported through an Interagency agreement with the Food and Drug Administration's Center for Devices, for use in validating DNA sequencing platforms for clinical applications.

Material Description

Loftstrand Labs Limited[^lofstrand] grew a large batch of r rm_strain provided by r strain_source to produce ~ r vial_count * rm_mass mg of total extracted DNA, divided equally into r vial_count vials labeled with RM number and strain name (r figr(label = "rmlabel", prefix = TRUE, link = TRUE, type = "Figure")). Each unit of r rm_number has approximately r rm_mass ug of extracted genomic DNA in TE buffer (10 mM Tris-HCl and 0.1 mM EDTA). The material is stored in 0.5 mL screw cap microcentrifuge tubes (Item number 1405-9710 USA Scientific [^usa]). r box_count cardboard boxes with r box_size wells were shipped to NIST. There were r box_size tubes in all of the boxes excluding box r exclusion_box. A total of r vial_count vials were received with r rm_mass ug of DNA in r rm_volume ul.

grid.raster(readPNG("RMLabel.png"))

This RM is isolated DNA rather than live cells because cells can mutate with each cell division, and the genome sequence may not be stable over time for live cells. Extracting DNA from a large batch of cells helps ensure that all vials contain essentially the same DNA sequences. Even though the DNA in the cells is likely to mutate at low frequency during the growth, the resulting mutations will be in extremely small proportions of the overall cells unless they confer a selective advantage and occur early in the culture process. Potential high frequency mutations will be identified as part of the material characterization process. Even if any mutations confer a selective advantage, the DNA from the large batch of cells was mixed before aliquoting to minimize the variability in the proportions of mutations among vials. Details specific to this RM regarding strain selection and material production included in the report of analysis cover letter.

Intended Use

The intended use of r rm_number is to help assess performance of high-throughput DNA sequencing method validation and quality control. The genomic DNA is intended to be analyzed in the same way as any other sample a laboratory would analyze extracted DNA, such as through the use of a genome assembly or variant calling bioinformatic pipelines (r figr(label = "process", prefix = TRUE, link = TRUE, type = "Figure")). Because the RM is extracted DNA, it does not assess pre-analytical steps such as DNA extraction. It does however, challenge sequencing library preparation, sequencing machines, base calling algorithms, and the subsequent bioinformatic analyses. This RM is not intended to assess bioinformatics steps such as strain identificaiton, phylogenetic analysis, or genome annotation. It is important to recognize the genome sequence is provided as an “Information Value” rather than “Certified Value” because we were not yet fully confident that we accounted for all known biases. However, we evaluated the material using a number of orthogonal methods to minimize the impact of systematic errors associated with individual platforms.

grid.raster(readPNG("measurement_process.png"))

Genome Sequence as a reference value

Characterizing a microbial genome RM at each of the $r rm_genome_size/1000000\times 10^6$ positions poses a significant metrological challenge. The consensus base call at each genomic position is a “nominal property”, because it is not a numerical value. A previous sequence-based SRM 2374 was certified primarily based on capillary (or “Sanger”) sequencing, a relatively mature sequencing technology that is not easily scalable to whole genomes. The maturity of capillary sequencing as well as the small size of SRM 2374 (< 100,000 bases) allowed for manual curation of capillary sequencing traces across the entire DNA sequences, and the assignment of qualitative (non-probabilistic) statements about the uncertainty of each base call.

To characterize r rm_number, we relied on a variety of less mature “Next Generation Sequencing” (NGS) technologies. Unlike capillary sequencing, NGS technologies can sequence a whole genome at a reasonable cost. NGS technologies measure many reads at each position, from which variant callers and consensus base callers calculate probabilistic estimates of uncertainty based on Bayesian statistics and assuming binomial sampling. Unfortunately, these probabilistic estimates of uncertainty are rarely accurate due to various sources of bias and error in the sequencing and bioinformatics (r figr(label = "causeEffect", prefix = TRUE, link = TRUE, type = "Figure"), see [@Olson2015] for a detailed discussion of these sources of error). Therefore, to develop a characterized reference genome sequence we first generated a de-novo assembly of the genome then performed a base level analysis of the genome assembly.

grid.raster(readPNG("MicroRMCauseEffect.png"))

In addition to the sources of error and bias associated with the measurement there is an unknown degree of biological variability within the batch. The genomic DNA was homogeneized prior to aliquoting into the individual vials but the biologically variability within the population of cells from which the DNA was extracted is unknown. The mutations that make up this variability include single nucleotide polymorphisms (SNPs), insertions and deletions (indels), as well as structural variants. Low frequency SNPs and indels can be mistaken as sequencing errors and low frequency structural variants can lead to ambiguities in the genome assembly. Our characterization methods did not attempt to identify low frequency biologically variability within the RM batch, however, when interpreting the reference material characterization procedure results the potential for these low frequency mutations should be considered.

Genomic purity is provided as an information value and has not been evaluated for sources of bias and error. The genomic purity of the reference material was assessed using a taxonomic sequence classification algorithm, the accuracy of the method was evaluated using simulated sequence data (Olson et al. in prep). However, sources of bias associated with the measurement, namely accuracy of the database and assignment algorithm, have not been investigated.

Characterization Process

Sequence Characterization

Sequencing Measurements

Eight vials from the r rm_number lot were randomly selected as representatives of the lot for sequencing at NIST. In addition to NIST measurements, the material was sequenced using the Pacific Biosciences [^pacbio] RSII, and optical mapping was performed by OpGen through their MapIt service.

The r rm_strain candidate reference materials were sequenced using three orthogonal sequencing platforms; Pacific Biosciences RSII (PacBio)[^pacbio], Ion Torrent PGM [^iontorrent], and Illumina MiSeq [^illumina]. For PacBio sequencing, the sequencing library was prepared using DNA Template Prep Kit 3.0 with pooled DNA from three randomly sampled vials of RM8376. The resulting library which was sequenced with the P6-C4 chemistry. For the Ion Torrent PGM [^iontorrent], and Illumina MiSeq [^illumina] seqeuncing eight replicate vials were randomly sampled from the lot of r vial_count vials. For MiSeq technical replicate libraries were prepared for each of the eight vials using the Nextera DNA Sample Prep Kit[^illumina], samples were barcoded using the Nextera Index Kit and sequenced using the MiSeq 600 cycle Reagent kit v3 for 2 X 300 bp reads. The 16 libraries were pooled and sequencing in a single run. Single barcoded 400 bp Ion Torrent PGM libraries were prepared for each of the eight vials using the Ion Xpress Plus kit [^iontorrent], with the Ion Xpress plus fragment plus library kit. The vials were barcoded using the IonXpress Kit[^iontorrent] and sequencing template was prepared using the Ion PGM Template OT2 400 kit and the IonPGM400 kit was used for sequencing on a 318C chip.

Sequence Analysis Methods

For the genome assembly we started with a de novo genome assembly and validated the assembly using optical mapping and sequencing data generated using orthogonal measurement methods. Pacific Biosciences [^pacbio] RSII sequencing data has median length of around 8kb, which is long enough to spand the ribosomal operon repeat regions, often resulting in a complete genome assembly [@Koren2013]. While the long reads allow for the assembly of a complete genome the relatively high error rate ~11\%[^pb-error] (primarily insertions and deletions) for individual reads can lead to small variant errors.

The assembly was validated using orthogonal methods; optical mapping and short read sequencing data. Optical mapping provides information regarding the overall structure of the genome based on the location of restriction sites within long fragments (average size > 200Mb). Comparison of the fragment patterns between the optical mapping data and in-silico generated fragment patterns from the genome assembly indicates potential misassemblies in the genome assembly [@Mendelowitz2014]. The Optical mapping data was generated by a commercial vendor OpGen[^opgen] and sources of bias and error have not been fully investigated. The due to the resolution limit of the optical mapping method it is unable to detect misassemblies smaller than 3 kb.

Pilon [@Walker2014], software developed to identify and correct misassemblies in microbial genomes, was used to identify and correct potential misassemblies smaller than 3 kb. Short read data generated using orthogonal sequencing technology Illumina[^illumina] MiSeq was used as Pilon input. The sources of bias associated with each of the orthogonal methods were not fully investigated and agreement between the methods supports the hypothesis that the assembly is correct.

Disagreements between the genome assembly and the optical mapping or short read results were investigated for sources of bias to account for the discrepancy. When appropriate the genome assembly was corrected. If we were unable to identify the source of the bias that caused the misassembly the genome sequence is annotated appropriately to indicate region where the assembly is ambiguous. It is important to note that agreement between orthogonal methods does not guarantee a correct result as they may all be susceptible to the same source(s) of bias. We intend to update the genome assembly as new data is collected and errors in the assembly are revealed.

Base level analysis was performed by evaluating the proportion of sequencing reads with bases supporting the genome assembly base call at individual genome positions, within the the lot (base purity) and among individual vials (base homogeneity). Short read sequencing data generated using the MiSeq and PGM were used for base level analysis. Base purity was defined for the two platform individually, using samtools mpileup [@Li2009]. Base homogeneity, was assessed using varscan somatic variant caller [@Koboldt2009], using the methods described in RM8398 report of analysis (REF).

The purity of the reference material in terms of presence of DNA from sources other than r paste0("_",rm_strain,"_") was assessed using the metagenomic taxonomic read classification algorithm PathoScope [@Hong2014]. This method uses an expectation maximization algorithm where the sequence data are first mapped to a database comprised on all sequence data in the Genbank nt database, then through an iterative process re-assigns ambiguously mapped reads to based on the proportion of reads mapped unambigously to individual taxa in the database.

The purity of the genomic material, presence of DNA from sources other than r paste0("_",rm_strain,"_") was assessed using the metagenomic taxonomic read classification algorithm PathoScope 2.0 [@Hong2014]. This method uses an expectation maximization algorithm where the sequence data are first mapped to a database comprised on all sequence data in the Genbank nt database, then through an iterative process re-assigns ambiguously mapped reads to based on the proportion of reads mapped unambiguously to individual taxa in the database. The PathoScope 2.0 taxonomic read classification pipeline includes an initial read filtering step (PathoQC), followed by mapping reads to a reference database (PathoMap - a wrapper for bowtie2 [@Langmead2012]), then an expectation-maximization classification algorithm (PathoID). The annotated Genbank nt database provided by the PathoScope developers was used as the reference database (ftp://pathoscope.bumc.bu.edu/data/nt_ti.fa.gz).

Method Reproducibility

To promote reproducible and transparent of material characterization software was developed Pipeline for Evaluating Prokaryotic References "PEPR". PEPR consists of three bioinformatic pipelines written in python (r figr(label = "workflowFig", prefix = TRUE, link = TRUE, type = "Figure")). The three bioinformatic pipelines are Genome Evaluation, Genome Characterization, and Genomic Purity. A YAML file (www.yaml.org) is used to define pipeline inputs. The pipeline coordinates the execution of a number of commandline tools, logging the standard output and standard error for each executed command in time stamped files for reference and debugging. Pipeline code avaiable at www.github.com/usnistgov/pepr. To reduce the barrier for reuse two Docker (www.docker.com) containers are available with pre-installed pipeline dependencies. Docker is a lightweight virtual environment that facilitates the sharing and distribution of computing environments that can be run on any desktop, cloud, and high performance computing environment, regardless of the operating system. The pepr container (www.registry.hub.docker.com/u/natedolson/pepr/) includes dependencies for the genome evaluation and characterization pipelines, excluding the Genome Analysis Toolkit (due to licensing restrictions), a Dockerfile for building the container is included in the pepr repository. The docker-pathoscope (www.registry.hub.docker.com/u/natedolson/docker-pathoscope/) has dependencies for the genomic purity pipeline installed.

grid.raster(readPNG("pepr-workflow.png"))

A software package peprr (github.com/usnistgov/peprr ) was developed for the statistical computing language R to compile the output from the genome evaluation, characterization, and genomic purity pipelines [@R2015]. The compiled data was formated into a series of data tables within a sqlite database to facilitate downstream analysis [@wickham2014tidy]. The package included functions to generate a number of summary tables and figures, including those in this report of analysis.

Sequence Analysis Results

Summary of Sequencing Datasets

The number of reads, read, length, insert size for paired-end datasets as well as coverage for r rm_number datasets, presented in (r figr(label = "seq_table", prefix = TRUE, link = TRUE, type = "Table")). Sequence data summary metrics were calculated as part of the Genomic Characterization Pipeline and loaded into peprDB ,an sqlite database, the summary table was generated from the peprDB. The MiSeq sequencing run with an average of r round(mean_miseq_library_read_count/2000000, digits = 1) million pair-end reads per library with a median read length of r paste0(as.character(miseq_med_read_length),"bp"). Whereas the PGM sequencing run produced r round(mean_pgm_library_read_count/2000000, digits = 1) million reads per library on average with a median read length of r paste0(as.character(pgm_med_read_length),"bp"). The higher throughput and paired-end reads resulted in a higher per library coverage for MiSeq compared to PGM (r paste0(as.character(mean_miseq_library_coverage),"X") vs. r paste0(as.character(mean_pgm_library_coverage),"X")). The r pb_replicate_number PacBio datasets are technical sequencing replicates (SMRT cells) from the same sequencing library, with a median read length r paste0(as.character(pacbio_med_read_length),"bp") and r paste0(as.character(pacbio_total_coverage),"X") total coverage. Between the three platforms a total of r paste0(as.character(total_coverage),"X") was obtained.

kable(seq_summary_table(peprDB), 
      round = 0, row.names = FALSE, caption = "Summary of sequencing datasets")

Genome Assembly

The de novo PacBio genome assembly was validated using Optical mappign and MiSeq short read sequencing data. Agarose plugs generated from the same culture stock as that used to generate the DNA reference material was used for optical mapping measurement (r figr(label = "opgenAssembly", prefix = TRUE, type = "Figure")). Restriction enzyme r restriction_enzyme was used for the restriction digest. The resulting genome map was aligned to the de novo assembly to identify potential errors in the PacBio assembly (r figr(label = "opgenAssembly", prefix = TRUE, type = "Figure")).

grid.raster(readPNG("opgen_assembly.png"))

grid.raster(readPNG("opgen_assembly_comparison.png"))

Base Level Analysis

Base Purity

.get_chrom_names <- function(db_con){
    dplyr::tbl(db_con, "pur_join") %>%
        dplyr::select(CHROM) %>%
        dplyr::collect() %>%
        .$CHROM %>% unique() %>%
        grep("novel", ., invert = TRUE, value = TRUE)
}
chrom_names <- .get_chrom_names(peprDB)


pur_dat <- dplyr::tbl(peprDB, "pur_join") %>%
    dplyr::filter(CHROM %in% chrom_names) %>%
    dplyr::collect() 
pur_dat_id <- pur_dat %>% 
    mutate(plat1_group = ifelse(plat1 < 0.99, "MiSeq-Low", "MiSeq-High")) %>% 
    mutate(plat2_group = ifelse(plat2 < 0.99, "PGM-Low", "PGM-High")) %>% 
    mutate(pur_group = paste(plat1_group, plat2_group, sep = " "))

pur_dat_id_filt <- pur_dat_id %>%
    dplyr::filter((plat1 < 0.99 | plat2 < 0.99))

min_val <- min(c(pur_dat_id_filt$plat1, pur_dat_id_filt$plat2)) 

## values for text
pos_pur_gt99_both <- pur_dat_id %>% filter(pur_group == "MiSeq-High PGM-High") %>% nrow()
pur_dat_max <- pur_dat_id %>% rowwise() %>% mutate(max_pur = max(plat1, plat2))
pos_pur_gt99_one <- pur_dat_max %>% filter(max_pur > 0.99) %>% nrow()
pos_pur_gt97_one <- pur_dat_max %>% filter(max_pur > 0.97) %>% nrow()
min_max_pur_position  <- paste0(as.character(round(min(pur_dat_max$max_pur), 2)*100),"%")

The base purity metric is the number bases in reads aligned to a genome position that are in agreement with the reference base divided by the total number of reads aligned to the position in agreement with either the reference base or alternate, excluding low quality bases. Comparison of the purity values between for all positions in the genome for two orthogonal sequencing methods was used to differentiate between positions with low purity values due to platform specific biases and those due to true biological diversity (r figr(label = "basePurityScatter", prefix = TRUE, link = TRUE, type = "Figure")). Out of $r rm_genome_size/1000000\times 10^6$ positions in the genome r pos_pur_gt99_both positions had purity values greater than 99\% for both short read sequencing platforms (r figr(label = "purityContingencyTable", prefix = TRUE, link = TRUE, type = "Table")). r pos_pur_gt99_one and r pos_pur_gt97_one positions with purity values greater than 99% and 97% for one of the two platforms. All positions had a purity value greater than r min_max_pur_position for at least one of the two platforms. Genome position distribution of the positions with low purity values for one of or both platforms in r figr(label = "basePurityPosDistribution", prefix = TRUE, link = TRUE, type = "Figure"). Purity values and number of bases in agreement with the reference and alternate base call for the 20 genome positions with the lowest average purity values for the two platforms (r figr(label = "purityLowPosTable", prefix = TRUE, link = TRUE, type = "Table")). Positions with the low purity values for one platform and not another indicate a potential platform specific bias. The reference base is identified using a third orthogonal sequencing method (PacBio RSII) and thus a low value for one of the two short reads sequencing platforms represents agreement between two of the three sequencing methods.

pur_dat_id %>% group_by(pur_group) %>% summarise(count = n()) %>% 
    kable(row.names = NA, caption = "Number of genome positions with purity values higher and lower than 0.99 for MiSeq and PGM sequencing platforms.")

ggplot2::ggplot(pur_dat_id_filt) +
    ggplot2::geom_point(ggplot2::aes(x = plat1, y = plat2, color = pur_group),
                        alpha = 0.5) +
    ggplot2::labs(x = "MiSeq", y = "PGM") + ggplot2::theme_bw() + 
    ggplot2::xlim(min_val, 1) + ggplot2::ylim(min_val, 1)

ggplot(pur_dat_id_filt %>% filter(CHROM == chrom_names[1])) + 
    geom_bar(aes(x = POS, fill = pur_group)) + 
    facet_wrap(~pur_group, ncol = 1, scales = "free_y") + theme_bw()

kable(low_purity_table(peprDB, rename_chroms = rename_chroms), 
      row.names = FALSE,
      caption = "Genome positions with lowest average purity values for MiSeq and PGM")

Sequence homogeneity

The genomic material homogeneity was assessed through a pairwise statistical analysis of the replicate MiSeq datasets using the varsan somatic variant caller (r figr(label = "homogeneityTable", prefix = TRUE, link = TRUE, type = "Table")). Comparison of individial dataset purity values for genome positions with purity values < 0.99 for MiSeq and PGM (r figr(label = "homogeneityPointFig", prefix = TRUE, link = TRUE, type = "Figure")). Inhomogeneity is indicated by purity values for replicate vials (indicated by point color) consistently outside the distribution of purity values observed for other vials. Differences in purity value distributions between the two platforms indicate platform specific biases.

kable(homogeneity_sig_table(peprDB, rename_cols = TRUE), 
      row.names = FALSE,
      caption = "Pairwise variant analysis results")

homogeneity_point_line_figure(peprDB)

Genomic Purity

Short read sequencing data was used to identify the proportion of DNA in the material from an organism other than the material strain. The PGM and MiSeq data the reference material has minimal genomic contaminants (r figr(label = "contamCountsFig", prefix = TRUE, link = TRUE, type = "Figure")), with a maximum of r max_contam reads in any dataset classified as not belonging to the genus r rm_genus. Abundance of contaminants for different taxonomic groups is show below in r figr(label = "contamPointLineFig", prefix = TRUE, link = TRUE, type = "Figure").

contam_counts_figure(peprDB, rm_genus)

contam_point_line_figure(peprDB, rm_genus)

DNA Fragment Size Stability

r rm_number is stable under the recommended storage conditions, although there is evidence that DNA fragment size may change over time, as indicated by the accelerated aging study. Samples were exposed to aging outside those recommended for normal use in a controlled experimental setting. The results of this experiments indicate the degradation, or fragmentation, of the DNA may come after advanced aging.

Description of Stability Study

Vials of RM material were selected at random to assess stability of the DNA, in terms of the distribution of the molecular weight of the DNA. For the stability study, a random number generator was used to determine the vials to measure.

An isochronous stability study was performed to ensure integrity of DNA fragment size for candidate r rm_numberover a variety of storage. Two time courses and temperature conditions (4°C, and 37°C) were used to mimic accelerated aging. The eight-week time course was started first, followed by a two-week time course. The study was conducted such that all time courses ended on the same day. A -20°C temperature control was used since this is the recommended storage condition of candidate r rm_number.

All temperature-based conditions were monitored with a data logger to track temperature, humidity and dew point over the entirety of the time-course. All samples were stored in opaque boxes to minimize exposure the light. At the conclusion of the time-course samples were placed in a -20°C freezer until measurements were made.

The stability study samples were analyze using gel electrophoresis. Six gels were run with one replicate for each treatment per gel along with two controls and two ladders(r figr(label = "stabilityExpFigure", prefix = TRUE, link = TRUE, type = "Figure")). Samples were randomly assigned to gels and lanes.

gel_design <- read.csv(gel_metadata) %>% 
                mutate(gel = sub(pattern = ".*_Gel_0",replacement =  "Gel-", x = gel),
                       gel = sub(pattern = ".jpg",replacement =  "", x = gel))

ggplot(gel_design) + geom_raster(aes(x = gel, y = lane, fill = condition), alpha = 0.5) +
    geom_text(aes(x = gel, y = lane, label = vial)) + facet_wrap(~gel,nrow = 1, scales = "free_x") + theme_bw() + theme(axis.text.x = element_blank(), axis.title.x = element_blank(), axis.ticks.x = element_blank(), legend.position = "bottom", legend.direction = "horizontal") + labs(y = "Lane", fill = "Conditions")

Analysis of Stability Study

To assess the stability of the DNA fragment size distribtion the proportion of DNA in a PFGE gel lane for each sample was determined for two regions between ladder markers. The high region is the proportion of DNA between the 48.5kb and 197kb markers and the low region between 9.24kb and 48.5kb markers. Treatment proportions were compared to controls using a T-test. The gels were processed using the R package peprrDnaStability (www.github/nate-d-olson/peprDnaStability). Prior to processing the gels were first manually cropped to remove regions of the gel to the left and right of the outer lanes. The runfindMarkers starts up an interactive application to define the parameters used in processing the gels. Cropped and original gel images as well as the parameter set used to analyze the gels are available at %%TODO%%.

Stability Study Results

Degradation of DNA leads to an incresed proportion of lower molecular weight DNA fragments of DNA than high molecular weight DNA fragments. The proportion of DNA in the high weight regions was lower for all treatments, with 8 week 37°C and 2 weeks at 4°C was statistically significant (p < 0.0125), when correcting for multiple comparisons.The difference between control and treatment the proportion of high molecular weight DNA (the ratio of DNA between the 194 and 48.5kb markers to the total DNA between the 194 and 9.42 kb markers) is show in r figr(label = "stabilityAnalysis", prefix = TRUE, link = TRUE, type = "Figure"). A one sided T-Test was used to determine if the difference in proportions of high molecular weight was significantly lower for the treatments compared to the controls (r figr(label = "stabilityTable", prefix = TRUE, link = TRUE, type = "Table")).

batch_data <- batch_process_gels(gel_data_dir)

## combine data frames
intensity_df <- dplyr::data_frame()
markers_df <- dplyr::data_frame()
binned_df <- dplyr::data_frame()
for(x in batch_data){
    gel_name <- x$gel
    intensity_df <- x$intensity_dat %>%
        dplyr::mutate(gel = gel_name) %>%
        dplyr::bind_rows(intensity_df)
    markers_df <- x$marker_dat %>%
        dplyr::mutate(gel = gel_name) %>%
        dplyr::bind_rows(markers_df)
    binned_df <- x$bin_dat %>%
        dplyr::mutate(gel = gel_name) %>%
        dplyr::bind_rows(binned_df)
}

## adding metadata
meta <- read.csv(gel_metadata,stringsAsFactors = F) %>% filter(condition != "Blank")
intensity_df$lane <- paste0("L", intensity_df$lane)
intensity <- left_join(intensity_df, meta) %>% filter(condition != "Blank")
markers_df$lane <- paste0("L", markers_df$lane)
markers <- left_join(markers_df, meta) %>% filter(condition != "Blank")
binned_df$lane <-paste0("L", binned_df$lane)
binned <- left_join(binned_df, meta) %>% filter(condition != "Blank")

highlow_bin <- binned %>%
    filter(!is.na(vial), bins %in% c("23.1-9.42", "48.5-23.1",
                                     "97-48.5","194-97")) %>%
    mutate(bins = ifelse(bins %in% c("23.1-9.42", "48.5-23.1"), 
                         "low","high")) %>% 
    group_by(lane, bins, gel, vial, condition) %>% 
    summarise(bin_intensity = sum(bin_intensity))

highlow_ratio <- highlow_bin %>% 
    spread(bins, bin_intensity) %>% 
    mutate(bin_ratio = high/(low + high))

highlow_ratio_control<- highlow_ratio %>% filter(condition == "Control")

control_ratio <- highlow_ratio_control %>% 
                    group_by(gel) %>% 
                    summarise(control_ratio = mean(bin_ratio))
highlow_diff <- highlow_ratio %>% 
                    left_join(control_ratio) %>% 
                    mutate(ratio_diff = bin_ratio - control_ratio) %>% 
                    filter(condition != "Control" & condition != "Ladder") %>% 
                    mutate(gel = sub(pattern = ".*Gel_0",replacement = "", x = gel),
                           gel = sub(".jpg","", gel))
ggplot(highlow_diff) + geom_point(aes(x = condition, y = ratio_diff, color = gel)) + 
    labs(x = "Treatment", y = "Ratio Difference from Control") + 
    theme_bw()

make_one_sided_ttest_df <- function (dat, value, gelAsFactor = TRUE) {
    tt_df <- data_frame()
    ## Pairwise comparison to control for each bin and each condition
    for(bin in unique(dat$bins)){
        df <- dat %>% filter(bins == bin)
        y <- df  %>% filter(condition == "Control")  %>% .[[value]]
        for(treatment in unique(df$condition[df$condition != "Control"])){
            x <-df  %>% filter(condition == treatment)  %>% .[[value]]
            tt <- t.test(x,y,alternative = "less")
            tt_df <- data_frame(size_bins = bin,
                                condition = treatment,
                                tstatistic = tt$statistic,
                                pvalue = tt$p.value) %>% 
                bind_rows(tt_df)
        }
    }
    tt_df %>% mutate(sig = ifelse(pvalue < 0.05/n(),TRUE,FALSE))
}

tt_df <- highlow_ratio %>% mutate(bins = "H") %>% 
            filter(condition != "Ladder") %>% 
            make_one_sided_ttest_df(value = "bin_ratio", gelAsFactor = FALSE)
tt_table <- tt_df %>% mutate(pvalue = round(pvalue, 4) %>% as.character(),
                             pvalue = ifelse(sig == TRUE, paste0("__",pvalue,"__"),pvalue),
                             pvalue = ifelse(pvalue == "__0__", "__<0.0000__", pvalue),
                             tstatistic = round(tstatistic, 2)) %>% 
            select(-sig) %>% 
            spread(size_bins, pvalue)
colnames(tt_table) <- c("Treatment", "T-Statistic", "p-value") 

kable(tt_table, row.names = FALSE, caption = "P-value of one sided t-tests comparing proprotion of DNA in high and low molecular weight bins between treatment and control. Significant p-values (alpha = 0.0125) are indicated in bold.")

\clearpage

References

[^opgen]: OpGen Inc. opgen.com 708 Quince Orchard Road Gaithersburg, MD 20878 USA [^usa]: USA Scientific, Inc. usascientific.com PO Box 3565 Ocala, FL 34478 USA [^illumina]: Illumina Inc., illumina.com/ 5200 Illumina Way San Diego, CA 92122 USA [^iontorrent]: Life Technologies Corp., iontorrent.com/ 7000 Shoreline Court #201, South San Francisco, CA 94080 USA [^pacbio]: Pacific Biosciences of California Inc. pacificbiosciences.com/ 1380 Willow Rd. Menlo Park, CA 94025 USA [^rast]: Rapid Annotaiton Using Subsystem Technology (RAST) server rast.nmpdr.org [^lofstrand]: Lofstrand Labs Limited 7961 Cessna Avenue, Gaithersburg, MD 20879 lofstrand.com/ [^pb-error]: Manufacturer stated error rate, in "Understanding Accuracy in SMRT Sequencing" http://www.pacificbiosciences.com

usnistgov/peprr documentation built on May 3, 2019, 2:38 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

usnistgov/peprr
R pepr utilities

In usnistgov/peprr: R pepr utilities

Introduction

General background

Material Description

Intended Use

Genome Sequence as a reference value

Characterization Process

Sequence Characterization

Sequencing Measurements

Sequence Analysis Methods

Method Reproducibility

Sequence Analysis Results

Summary of Sequencing Datasets

Genome Assembly

Base Level Analysis

Base Purity

Sequence homogeneity

Genomic Purity

DNA Fragment Size Stability

Description of Stability Study

Analysis of Stability Study

Stability Study Results

References

R Package Documentation

Browse R Packages

We want your feedback!

usnistgov/peprr R pepr utilities

In usnistgov/peprr: R pepr utilities

Introduction

General background

Material Description

Intended Use

Genome Sequence as a reference value

Characterization Process

Sequence Characterization

Sequencing Measurements

Sequence Analysis Methods

Method Reproducibility

Sequence Analysis Results

Summary of Sequencing Datasets

Genome Assembly

Base Level Analysis

Base Purity

Sequence homogeneity

Genomic Purity

DNA Fragment Size Stability

Description of Stability Study

Analysis of Stability Study

Stability Study Results

References

R Package Documentation

Browse R Packages

We want your feedback!

usnistgov/peprr
R pepr utilities