r params$run_name
knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE, fig.align = "center", fig.retina = 2, fig.height = 4, fig.width = 7, out.width = "100%") #Load libraries library(dplyr) library(ggplot2) library(scales) library(refsamp)
quast_report <- extract_quast(params$input_dir) checkm_report <- extract_checkm(params$input_dir) bbtools_report <- extract_bbtools(params$input_dir) run_date <- get_run_date(params$run_name)
refsamp_run_metrics <- merging_by_sample(c("bbtools_report", "quast_report", "checkm_report"), run_date = run_date) if ( file.exists(params$history_data) ) { history_data <- readr::read_csv(params$history_data, col_types = "cnnnnnnnnnnncc") names(history_data) <- names(refsamp_run_metrics) history_data <- history_data %>% mutate(Run_date = as.Date(.$Run_date, format = "%Y-%m-%d")) %>% bind_rows(refsamp_run_metrics) %>% distinct(across(c(Sample, Run_date)), .keep_all = TRUE) %>% filter(difftime(Run_date, run_date, units = "days") %>% as.numeric() %>% `>`(-360)) } else { history_data <- refsamp_run_metrics }
A set of reference samples are continuously sequenced in our in house sequencing facility and the assembly metrics are monitored to ensure quality of the data produced. All samples were pre-processed and their quality was analyzed using the Juno pipeline. Some of the quality metrics are analyzed and followed over time.
List of reference samples:
refsamp:::genera_criteria
Number of contigs : total number of contigs in the assembly REF.
In the Juno pipeline, scaffolds smaller than 500 nt are filtered out. This plot shows the number of contigs that were kept in the pipeline.
Interpretation help: The number of contigs per sample should be lower than the acceptance criteria (dashed line). If no dashed line is shown, it means that the acceptance criteria for that genus has not been determined or uploaded to the {refsampr} package.
plot_time_metrics(history_data, "# contigs", scales = 'free_y') + labs(y = "Number of contigs ( >500nt )", x = "") + scale_y_continuous( label = comma_format(accuracy = 0.1) )
Total length : Total number of bases in the assembly REF.
Interpretation help: The total length of the genome per sample should be in between the upper and lower limits of the acceptance criteria (dashed lines). If no dashed lines are shown, it means that the acceptance criteria for that genus has not been determined or uploaded to the {refsampr} package.
plot_time_metrics(history_data, "Total length", scales = 'free_y') + labs(y = "Total length", x = "") + scale_y_continuous( label = comma_format(accuracy = 1) )
N50 : The largest contig length, L, such that using contigs of length forumla accounts for at least 50% of the bases of the assembly REF.
The N50 relates to contiguity of the assembly. The N50 value is a threshold contig size. If contigs with size N50 or larger are kept, half of the reference genome can be covered REF.
Interpretation help: The N50 should be higher than the acceptance criteria (dashed line). If no dashed line is shown, it means that the acceptance criteria for that genus has not been determined or uploaded to the {refsampr} package.
plot_time_metrics(history_data, "N50", scales = 'free_y') + labs(y = "N50 (bp)", x = "") + scale_y_continuous( label = comma_format(accuracy = 1) )
GC content : percentage of nucleotides that are either G or C.
A deviation from the actual (literature reported) GC content in the reference genome can reflect problems in the sequencing of that particular genus/species, contamination or even swapping of samples.
Interpretation help: The total length of the genome per sample should be in between the upper and lower limits of the acceptance criteria (dashed lines). If no dashed lines are shown, it means that the acceptance criteria for that genus has not been determined or uploaded to the {refsampr} package.
plot_time_metrics(history_data, "GC (%)", scales = 'free_y') + labs(x = NULL, y = "GC content") + scale_y_continuous( label = percent_format(accuracy = 0.1, scale = 1) )
Percent mapped : percentage of reads that could be mapped back to the assembly.
Interpretation help: In the case of isolates, the higher the percentage of reads that could be mapped, the better. This metric is normally very high so do pay attention to a drop or a trend to dropping because it could be indicating contamination.
plot_time_metrics(history_data, "Percent mapped") + labs(x = NULL, y = "Mapped Reads")+ scale_y_continuous( label = percent_format(accuracy = 0.1, scale = 1) )
Coverage, depth of coverage or mapping depth : average number of times a base of a genome is sequenced. Calculated as the number of bases of all short reads that match a genome divided by the length of this genome. It is often expressed as 1X, 2X, 3X,... (1, 2, or, 3 times coverage) REF.
Interpretation help: The average depth of coverage should be higher than the acceptance criteria (dashed line). If no dashed line is shown, it means that the acceptance criteria for that genus has not been determined or uploaded to the {refsampr} package.
plot_time_metrics(history_data, "Average coverage") + labs(x = NULL, y = "Average coverage") + scale_y_continuous( label = comma_format(accuracy = 1, suffix = " X") )
Percent of reference bases covered : percentage of bases of a reference genome that are covered by at least 1 read.
Interpretation help: In the case of isolates, the higher the percentage of bases covered, the better. This metric is normally very high so do pay attention to a drop or a trend to dropping because it could be indicating more errors than expected or even a sample swap.
plot_time_metrics(history_data, "Percent of reference bases covered") + labs(x = NULL, y = "Reference Bases Covered")+ scale_y_continuous( label = percent_format(accuracy = 0.1, scale = 1) )
CheckM makes use of a set of taxon-specific marker genes. It calculates statistics based on that.
Contamination : estimated from the number of multicopy marker genes identified in each marker set REF.
Interpretation help: The average depth of coverage should be lower than the acceptance criteria (dashed line). If no dashed line is shown, it means that the acceptance criteria for that genus has not been determined or uploaded to the {refsampr} package.
plot_time_metrics(history_data, "contamination") + labs(x = NULL, y = "Contamination")+ scale_y_continuous( label = percent_format(accuracy = 0.01, scale = 1) )
Genome completeness : estimated as the number of marker sets present in a genome taking into account that only a portion of a marker set may be identified REF.
Interpretation help: The genome completeness should be higher than the acceptance criteria (dashed line). If no dashed line is shown, it means that the acceptance criteria for that genus has not been determined or uploaded to the {refsampr} package.
plot_time_metrics(history_data, "completeness") + labs(x = NULL, y = "Completeness") + scale_y_continuous( label = percent_format(accuracy = 0.1, scale = 1) )
write.csv(history_data, paste0(params$history_data), row.names = FALSE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.