Run: r params$run_name

knitr::opts_chunk$set(echo = FALSE, 
                      warning = FALSE, message = FALSE,
                      fig.align = "center", fig.retina = 2,
                      fig.height = 4, fig.width = 7,
                      out.width = "100%")

#Load libraries 
library(dplyr)
library(ggplot2)
library(scales)
library(refsamp)
quast_report <- extract_quast(params$input_dir)

checkm_report <- extract_checkm(params$input_dir)

bbtools_report <- extract_bbtools(params$input_dir)

run_date <- get_run_date(params$run_name)
refsamp_run_metrics <- merging_by_sample(c("bbtools_report", "quast_report", "checkm_report"), 
                                         run_date = run_date)

if ( file.exists(params$history_data) ) {

  history_data <- readr::read_csv(params$history_data, col_types = "cnnnnnnnnnnncc")

  names(history_data) <- names(refsamp_run_metrics)

  history_data <- history_data %>%
    mutate(Run_date = as.Date(.$Run_date, format = "%Y-%m-%d")) %>%
    bind_rows(refsamp_run_metrics) %>%
    distinct(across(c(Sample, Run_date)), .keep_all = TRUE) %>%
    filter(difftime(Run_date, run_date, units = "days") %>% 
             as.numeric() %>% `>`(-360))

} else {

  history_data <- refsamp_run_metrics

}

A set of reference samples are continuously sequenced in our in house sequencing facility and the assembly metrics are monitored to ensure quality of the data produced. All samples were pre-processed and their quality was analyzed using the Juno pipeline. Some of the quality metrics are analyzed and followed over time.

List of reference samples:

refsamp:::genera_criteria

Quality assessment statistics with QUAST

Number of contigs

Number of contigs : total number of contigs in the assembly REF.

In the Juno pipeline, scaffolds smaller than 500 nt are filtered out. This plot shows the number of contigs that were kept in the pipeline.

Interpretation help: The number of contigs per sample should be lower than the acceptance criteria (dashed line). If no dashed line is shown, it means that the acceptance criteria for that genus has not been determined or uploaded to the {refsampr} package.

plot_time_metrics(history_data, "# contigs", scales = 'free_y') +
  labs(y = "Number of contigs ( >500nt )", x = "") +
   scale_y_continuous( label = comma_format(accuracy = 0.1) )

Total length

Total length : Total number of bases in the assembly REF.

Interpretation help: The total length of the genome per sample should be in between the upper and lower limits of the acceptance criteria (dashed lines). If no dashed lines are shown, it means that the acceptance criteria for that genus has not been determined or uploaded to the {refsampr} package.

plot_time_metrics(history_data, "Total length", scales = 'free_y') + 
   labs(y = "Total length", x = "") +
   scale_y_continuous( label = comma_format(accuracy = 1) )

N50

N50 : The largest contig length, L, such that using contigs of length forumla accounts for at least 50% of the bases of the assembly REF.

The N50 relates to contiguity of the assembly. The N50 value is a threshold contig size. If contigs with size N50 or larger are kept, half of the reference genome can be covered REF.

Interpretation help: The N50 should be higher than the acceptance criteria (dashed line). If no dashed line is shown, it means that the acceptance criteria for that genus has not been determined or uploaded to the {refsampr} package.

plot_time_metrics(history_data, "N50", scales = 'free_y') + 
  labs(y = "N50 (bp)", x = "") +
   scale_y_continuous( label = comma_format(accuracy = 1) )

GC content

GC content : percentage of nucleotides that are either G or C.

A deviation from the actual (literature reported) GC content in the reference genome can reflect problems in the sequencing of that particular genus/species, contamination or even swapping of samples.

Interpretation help: The total length of the genome per sample should be in between the upper and lower limits of the acceptance criteria (dashed lines). If no dashed lines are shown, it means that the acceptance criteria for that genus has not been determined or uploaded to the {refsampr} package.

plot_time_metrics(history_data, "GC (%)", scales = 'free_y') + 
  labs(x = NULL, y = "GC content") +
  scale_y_continuous( label = percent_format(accuracy = 0.1, scale = 1) )

Quality assessment statistics with Bbtools

Percent mapped

Percent mapped : percentage of reads that could be mapped back to the assembly.

Interpretation help: In the case of isolates, the higher the percentage of reads that could be mapped, the better. This metric is normally very high so do pay attention to a drop or a trend to dropping because it could be indicating contamination.

plot_time_metrics(history_data, "Percent mapped") + 
  labs(x = NULL, y = "Mapped Reads")+
   scale_y_continuous( label = percent_format(accuracy = 0.1, scale = 1) )

Average depth of coverage

Coverage, depth of coverage or mapping depth : average number of times a base of a genome is sequenced. Calculated as the number of bases of all short reads that match a genome divided by the length of this genome. It is often expressed as 1X, 2X, 3X,... (1, 2, or, 3 times coverage) REF.

Interpretation help: The average depth of coverage should be higher than the acceptance criteria (dashed line). If no dashed line is shown, it means that the acceptance criteria for that genus has not been determined or uploaded to the {refsampr} package.

plot_time_metrics(history_data, "Average coverage") + 
  labs(x = NULL, y = "Average coverage") +
   scale_y_continuous( label = comma_format(accuracy = 1, suffix = " X") )

Percent of reference bases covered

Percent of reference bases covered : percentage of bases of a reference genome that are covered by at least 1 read.

Interpretation help: In the case of isolates, the higher the percentage of bases covered, the better. This metric is normally very high so do pay attention to a drop or a trend to dropping because it could be indicating more errors than expected or even a sample swap.

plot_time_metrics(history_data, "Percent of reference bases covered") + 
  labs(x = NULL, y = "Reference Bases Covered")+
  scale_y_continuous( label = percent_format(accuracy = 0.1, scale = 1) )

Quality assessment statistics with CheckM

CheckM makes use of a set of taxon-specific marker genes. It calculates statistics based on that.

Contamination

Contamination : estimated from the number of multicopy marker genes identified in each marker set REF.

Interpretation help: The average depth of coverage should be lower than the acceptance criteria (dashed line). If no dashed line is shown, it means that the acceptance criteria for that genus has not been determined or uploaded to the {refsampr} package.

plot_time_metrics(history_data, "contamination") + 
  labs(x = NULL, y = "Contamination")+
   scale_y_continuous( label = percent_format(accuracy = 0.01, scale = 1) )

Genome completeness

Genome completeness : estimated as the number of marker sets present in a genome taking into account that only a portion of a marker set may be identified REF.

Interpretation help: The genome completeness should be higher than the acceptance criteria (dashed line). If no dashed line is shown, it means that the acceptance criteria for that genus has not been determined or uploaded to the {refsampr} package.

plot_time_metrics(history_data, "completeness") + 
  labs(x = NULL, y = "Completeness") +
   scale_y_continuous( label = percent_format(accuracy = 0.1, scale = 1) )
write.csv(history_data, paste0(params$history_data), row.names = FALSE)


AleSR13/refsampr documentation built on May 30, 2022, 5:42 a.m.