knitr::opts_chunk$set(collapse = TRUE, comment = ">", dev = "svg") #knitr::opts_knit$set(root.dir = rprojroot::find_rstudio_root_file()) #options(tibble.print_min = 4L, tibble.print_max = 4L)
From a more biological point of view, this package simplifies many downstream analyses of high-throughput data that otherwise would require many hours of code typing and case-to-case adaptation. Moreover, most of the functions aimed to visualize these kind of data are thought to provide a high level of possible customization with a large number of graphical parameters compared to the commonly used already available tools. Another advantage of this package is that it offers multiple methods, with a corresponding visualization, to quantify the difference of signal between samples, in a qualitatively and/or quantitatively way, without any additional coding.
The guide is divided in four parts: 1) the first one will explore some analyses and visualization of RNA-seq data; 2) the second one the representation and quantification of targeted sequencing data (ChIP-seq, ATAC-seq, etc.); 3) the third part is focused on the analyses of RNA expression by RT-qPCR; 4) while the last part is focused on some of the “general” tools available in this package.
If you use this package, please cite:
**Citation**
citation("Rseb")
For this analysis it is common to use the R-package DESeq2^[Love, M.I., Huber, W., Anders, S., "Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2", Genome Biology 15(12):550 (2014)] which returns a table as the following one:
data("RNAseq", package = "Rseb") RNAseq
data("RNAseq", package = "Rseb") set.seed(floor(runif(1, 1, 1000))) n = floor(runif(n = 10, min = 1, max = nrow(RNAseq))) knitr::kable(RNAseq[n,], row.names = F, caption = "**DESeq2 results table example**")
DEstatus
helps in this definition. It takes as input the fold change expression and the p-value adjusted, then a threshold for these two parameters can be set to define four status:
All the labels and thresholds can be custom. We can proceed to add a column with the differential expression status to the original table.
require(dplyr) RNAseq <- RNAseq %>% mutate(DE.status = Rseb::DE.status(log2FC = RNAseq$log2FC, p.value.adjusted = RNAseq$padj, FC_threshold = 2, # Linear value FC_NoResp_left = 0.9, # Automatically 0.9 <= FC <= 1/0.9) p.value_threshold = 0.05, low.FC.status.label = "DOWN", high.FC.status.label = "UP", unresponsive.label = "UNRESPONSIVE", null.label = "NULL"))
knitr::kable(RNAseq[n,], row.names = F, caption = "**RNA-seq table with differential expression status**")
It is possible now to use the DE.status
column to count the number of genes per each group:
RNAseq.summary.table <- RNAseq %>% group_by(DE.status) %>% summarise(N = n()) %>% rbind(c("Total", nrow(RNAseq)))
knitr::kable(RNAseq.summary.table, row.names = F, caption = "<b>RNA-seq differential expression summary</b>", format = "html", table.attr = "style='width:30%';")
require(ggplot2) MA.plot <- ggplot(data = RNAseq, aes(x = log2(baseMean), y = log2FC, col = DE.status)) + geom_point(size = 2) + scale_color_manual(values = c("#F8766D", "gray30", "#00A5CF", "#00BA38")) + ggtitle("MA-plot") + theme_classic() MA.plot
volcano
.
volcano.plot <- Rseb::volcano(log2FC_data = RNAseq$log2FC, padj_data = RNAseq$padj, FC_t = 2, p_t = 0.05, FC_unresponsive_left = 0.9, left_label = "DOWN", unresponsive_label = "UNRESPONSIVE", right_label = "UP", null_label = "NULL", title = "Volcano", point_size = 2) volcano.plot
This function allows to label the points corresponding to differentially expressed genes with the corresponding gene name if required.
volcano.plot.with.names <- Rseb::volcano(log2FC_data = RNAseq$log2FC, padj_data = RNAseq$padj, names = RNAseq$geneName, FC_t = 2, p_t = 0.05, FC_unresponsive_left = 0.9, left_label = "DOWN", unresponsive_label = "UNRESPONSIVE", right_label = "UP", null_label = "NULL", title = "Volcano", point_size = 2, right_names = T, names_size = 3) volcano.plot.with.names
The general workflow to generate this type of graphs includes the generation of a score matrix for each position of each region and each signal (A). This matrix than can be reshaped in order to calculate the mean/median/sum of the signal among regions in a set (B, density profile) or for each single region (C, density summary. Furthermore, it is possible to compute and visualize the difference of signal between to samples (D, density difference area and/or their correlation). This process is schematized in the following figure and will be discussed in detail in the further paragraphs.
#magick::image_read(rsvg::rsvg_svg("url_image")) knitr::include_graphics("https://sebastian-gregoricchio.github.io/Rseb/vignettes/images/Rseb_workflow.svg")
computeMatrix
from deepTools^[Ramírez, Fidel, Devon P. Ryan, Björn Grüning, Vivek Bhardwaj, Fabian Kilpert, Andreas S. Richter, Steffen Heyne, Friederike Dündar, and Thomas Manke. deepTools2: A next Generation Web Server for Deep-Sequencing Data Analysis. Nucleic Acids Research (2016). doi:10.1093/nar/gkw257.]. This package provides a function called computeMatrix.deeptools
that allows to automatically build the command line required to run the tool on the command line.
Alternatively, but based on a more time consuming algorithm, a fully R-based function, density.matrix
, is available with this package (see box below). This function generates a list that can be directly used as input for others functions as described further. The parameters are more or less the same found in the computeMatrix.deeptools
function and for this reason only the latter will be described in this guide.
**The `density.matrix` function**
plot.density.profile
/plot.density.profile.smooth
function which accepts as input a score matrix as described hereafter. Noteworthy, it is possible to narrow the final plot to a specific range by the option "subset.range" without the need to re-compute the matrix.
The function returns a list with the scores table, a list of each single plot and a single multiplot with the combination of all the single plots. The latter can be directly exported by passing the full output path to the option "multiplot.export.file", while the option "print.multiplot" allows to show the multiplot output without call/print the variable (useful to save time during the parameters test phase).
Here an example of the output list structure:
data("deeptools.matrix", package = "Rseb") density.profile.by.group <- Rseb::plot.density.profile( matrix.file = deeptools.matrix, signal.type = "mean", error.type = "sem", plot.by.group = T, missing.data.as.zero = T, y.identical.auto = T, text.size = 10, axis.line.width = 0.25, line.width = 0.5, plot.error = T, write.reference.points = T, plot.vertical.lines = F, colors = c("steelblue", "mediumseagreen"), n.row.multiplot = 2, by.row = T, y.lab = "Mean ChIP-seq signal \u00b1 SEM") str(density.profile.by.group, max.level = 1, give.attr = F)
# Load the example matrix list data("deeptools.matrix", package = "Rseb") # Generate the density profile by group density.profile.by.group <- Rseb::plot.density.profile( matrix.file = deeptools.matrix, signal.type = "mean", error.type = "sem", plot.by.group = T, missing.data.as.zero = T, y.identical.auto = T, text.size = 10, axis.line.width = 0.25, line.width = 0.5, plot.error = T, write.reference.points = T, plot.vertical.lines = F, colors = c("steelblue", "mediumseagreen"), n.row.multiplot = 2, by.row = T, y.lab = "Mean ChIP-seq signal \u00b1 SEM") # Generate the density profile by sample density.profile.by.sample <- Rseb::plot.density.profile( matrix.file = deeptools.matrix, signal.type = "mean", error.type = "sem", plot.by.group = FALSE, missing.data.as.zero = TRUE, y.identical.auto = TRUE, text.size = 10, axis.line.width = 0.25, line.width = 0.5, plot.error = TRUE, write.reference.points = TRUE, plot.vertical.lines = FALSE, colors = rep(c("indianred", "darkgoldenrod2"), 2), line.type = rep(c("solid", "dotted"), each = 2), n.row.multiplot = 1, legend.position = c(0.2, 0.75), y.lab = "Mean ChIP-seq signal \u00b1 SEM") cowplot::plot_grid(density.profile.by.group$multiplot, density.profile.by.sample$multiplot, nrow = 2, labels = "AUTO", rel_heights = c(2, 1))
Therefore, below in this section, three different methods to infer and visualize the difference between two samples will be explored: the density summary, the area difference and the correlation plot.
All these methods import a score matrix generated, for instance, by deepTools or by the density.matrix
function.
Noteworthy, in all cases it is possible to carryout the analyses narrowing down the range on which compute the scores by specifying the range to use in the "subset.range" parameter without the need to re-compute the matrix.
For both this two methods a t-test (parametric) or Wilcoxon-test (non parametric) could be automatically performed, whereas a test on the distributions cold be manually performed by Kruskal-Wallis test using the values stored in the data.table in the output.
plot.density.summary
which shows, through a violin plot, the distribution of the scores over each region in a given group/sample. An option allows to add a mean symbol with the corresponding SD/SEM. Furthermore, the statistical comparison could be added directly above each violin plot either as "stars" or numeric format. As well as for a boxplot, on each violin plot three lines indicate respectively the 25^th^, 50^th^ and 75^th^ percentile of the data distribution, respectively.
Since the original size of the regions could not be the same it, is preferable to use as score on which compute the analyses the sum of the score over each region (equivalent to the area under the curve of a density profile).
The output of this function is composed of a list of single plots, as well as a multiple plot (multiplot) with the distribution of the scores of the different samples per each group or alternatively the scores on different regions per each sample or group. The output list provides a summary plot with all the region groups for each sample, or alternatively all the samples for each region group. Finally, also a table with all the comparisons with the respective significance is available.
Here an example of the output list structure:
data("deeptools.matrix", package = "Rseb") summary.plot.by.group <- Rseb::plot.density.summary( matrix.file = deeptools.matrix, plot.by.group = TRUE, missing.data.as.zero = TRUE, signal.type = "sum", stat.paired = FALSE, stat.hide.ns = FALSE, axis.line.width = 0.5, mean.symbol.size = 0.2, y.identical.auto = TRUE, text.size = 10, border.width = 0.25, subset.range = c(-1000, 1000), #bp from 0 point colors = c("steelblue", "mediumseagreen"), legend.position = "right", n.row.multiplot = 2, by.row = TRUE) str(summary.plot.by.group, max.level = 1, give.attr = F)
# Load the example matrix list data("deeptools.matrix", package = "Rseb") # Generate the density profile by group summary.plot.by.group <- Rseb::plot.density.summary( matrix.file = deeptools.matrix, plot.by.group = TRUE, missing.data.as.zero = TRUE, signal.type = "sum", stat.paired = FALSE, stat.hide.ns = FALSE, axis.line.width = 0.5, mean.symbol.size = 0.2, y.identical.auto = TRUE, text.size = 10, border.width = 0.25, subset.range = c(-1000, 1000), #bp from 0 point colors = c("steelblue", "mediumseagreen"), legend.position = "right", n.row.multiplot = 2, by.row = TRUE) # Generate the density profile by sample summary.plot.by.sample <- Rseb::plot.density.summary( matrix.file = deeptools.matrix, plot.by.group = FALSE, missing.data.as.zero = TRUE, signal.type = "sum", stat.paired = FALSE, stat.hide.ns = FALSE, axis.line.width = 0.5, mean.symbol.size = 0.2, y.identical.auto = TRUE, text.size = 10, border.width = 0.25, subset.range = c(-1000, 1000), #bp from 0 point colors = c("steelblue", "mediumseagreen"), legend.position = "right", n.row.multiplot = 2, by.row = TRUE) cowplot::plot_grid(summary.plot.by.group$multiplot, summary.plot.by.sample$multiplot, nrow = 1, labels = "AUTO", rel_widths = c(3, 2))
cowplot::plot_grid(summary.plot.by.group$summary.plot.samples, summary.plot.by.group$summary.plot.regions, nrow = 2, labels = "AUTO")
summary.plot.by.sample$means.comparisons
knitr::kable(summary.plot.by.sample$means.comparisons, row.names = F, caption = "**Example of comparisons table for the plots by sample**")
plot.density.differences
. Also in this case the input of the function is a score matrix generated by computeMatrix.deeptools
or density.matrix
and it is possible to scale-down to a specific range the final plot by the option "subset.range" without the need to re-compute the matrix..
# Load the example matrix list data("deeptools.matrix", package = "Rseb") # Generate the correlation.plot correlation.plot <- Rseb::plot.density.differences(matrix.file = deeptools.matrix, inverted.comparisons = T, missing.data.as.zero = T, signal.type = "sum", stat.paired = T, points.size = 0.25, axis.line.width = 0.25, area.y.identical.auto = T, text.size = 10, correlation.show.equation = T, correlation.correlation.line.width = 0.5, correlation.correlation.line.color = "#ED8141", subset.range = c(-1000, 1000), #bp from 0 point colors = c("steelblue", "mediumseagreen", "purple"), legend.position = c(0.8, 0.25), error.type = "sem") cowplot::plot_grid(plotlist = Rseb::uniform.y.axis(Rseb::uniform.x.axis(correlation.plot$correlation.plot.byGroup.list)), nrow = 2, byrow = T)
plot.density.differences
, but in this case instead of the population shift the regions are ranked by the value of the area/signal subtraction. Then the closest to 0 ("no difference") rank position is calculated and its value subtracted to all the rank positions. In this way on the plot it is possible to know the number of regions with a negative (lower signal) or positive (higher signal) score. This representation allows firstly to know how many regions have an enriched signal (Sample 1) compared to those that have an enrichment of the other compared signal (Sample 2), secondly setting the axis to the same ranges it is possible to observe whether de differences between two samples for a region set compared to another are wide or rather narrowed around the 0 (no difference between the samples signal), beside eventual outliers that could distort the mean in the density profile.
# Load the example matrix list data("deeptools.matrix", package = "Rseb") # Generate the area.plot areaDifference.plot <- Rseb::plot.density.differences(matrix.file = deeptools.matrix, inverted.comparisons = T, missing.data.as.zero = T, signal.type = "sum", stat.paired = T, points.size = 0.25, axis.line.width = 0.25, area.y.identical.auto = T, text.size = 10, subset.range = c(-1000, 1000), #bp from 0 point colors = c("steelblue", "mediumseagreen", "purple"), legend.position = c(0.2, 0.80), error.type = "sem") cowplot::plot_grid(plotlist = Rseb::uniform.y.axis(areaDifference.plot$area.plot.byGroup.list), nrow = 2, byrow = T)
The function IGVsnap
accepts as input both a list of gene symbols (the genomic localization will be retrieved using biomaRt
) or a list of loci in any format accepted by IGV. As output will be generated a text file with a script that can be run on IGV: IGV > Tools > Run Batch Script... > chose your batch script file generated by IGVsnap
.
To chose the tracks that will be used there are two possibilities:
# Define the gene list gene_list <- c("Spi1", "Idh1", "Bcl2l11", "Mcl1", "Polr2a", "Hdac1") # Generate the script Rseb::IGVsnap(loci_vector = gene_list, input_type = "genes", biomart = "ensembl", dataset = "mmusculus_gene_ensembl", reference_genome = "mm10", fivePrime = 1000, threePrime = 1000, snap_names = gene_list, IGV_batch_file = "path/to/output/script.txt", snap_directory = "path/to/directory/in/wich/the/images/will/be/generated", snap_image_format = "png", maxPanelHeight = 1500)
qPCR.rna.exp
directly on results in excel file format generated by the qPCR devices (it could be required to indicate the position number of the excel sheet to use, using the parameter results.sheet.position
). However, it is also possible to provide a data.frame with at least two columns: Target Name
(e.g., primers), Sample Name
(e.g. samples), CT
[all must include space and capital letters as indicated].
A mandatory parameter is the list of "housekeeping" genes that will be used for the computation of the normalized expression. On the other hand, also fold changes (FC) over a reference sample are computed. The reference sample if it is not provided by the user will be automatically defined as the first one appearing in the list of all the samples. Notably, specific samples (e.g., blank the negative controls) could be excluded using the parameter exclude.samples
.
In the list resulting form the analyses there is a plot showing the technical replicates detected in the data. Technical replicates that as a difference greater than a certain threshold could be excluded setting the parameter max.delta.reps
(default value 0.5)
# Load the data data_to_analyse <- Rseb::qPCR.results.rep1 print(data_to_analyse)
data_to_analyse = data.frame(Rseb::qPCR.results.rep1, check.names = F) knitr::kable(head(data_to_analyse, n = 10), row.names = F, caption = '**Example of RT-qPCR data.frame for RNA expression analyses**')
# Run the analyses analyses <- Rseb::qPCR.rna.exp(results.file = data_to_analyse, housekeeping.genes = c("geneB", "housekeeping"), reference.sample = "Ctrl", max.delta.reps = 0.5)
cat(paste0( " analyses\n", " ¦--original.table\n", " ¦--reshaped.table\n", " ¦--reshaped.table.cleaned\n", " ¦--reps.validation.plot\n", " ¦--analyzed.data\n", " ¦ ¦--geneB\n", " ¦ ¦--housekeeping\n", " ¦ °--mean_FC_housekeeping\n", " ¦--expression.plots\n", " ¦ ¦--geneB\n", " ¦ °--housekeeping\n", " °--foldChange.plots\n", " ¦--geneB\n", " ¦--housekeeping\n", " °--mean_FC_housekeeping"))
knitr::kable(head(analyses$reshaped.table.cleaned, n = 15), row.names = F, caption = '**RT-qPCR analyses results: `reshaped.table.cleaned`**')
print(analyses$reps.validation.plot)
knitr::kable(head(analyses$analyzed.data$housekeeping, n = 15), row.names = F, caption = '**RT-qPCR analyses results: `analyzed.data$housekeeping`**')
cowplot::plot_grid(analyses$expression.plots$housekeeping, cowplot::plot_grid(analyses$foldChange.plots$housekeeping, analyses$foldChange.plots$mean.FC.housekeeping, labels = c("B", "C")), labels = c("A",""), nrow = 2)
qPCR.ran.exp
objects can be submitted to the function qPCR.rna.mean.reps
.
The function will automatically define which are the housekeeping genes and the reference sample using the list names or the column names of the tables in the qPCR.ran.exp
objects. However, if the reference sample used are multiple and/or not shared among replicates, the first sample in the order is used as reference. The reference sample can be also redefined manually during the averaging: all the Fold Changes will be recomputed. In all cases a message indicating which reference sample has been used is printed at each time (it will be always the first sample in the fold change plots).
Also in this case, specific samples can be excluded from the analyses (e.g., blank negative control).
In the following example two different analyses are run starting from the same data (rep1 and rep2). Then the two replicates are merged by qPCR.rna.mean.reps
.
# Run the analyses for the two replicates rep1 <- Rseb::qPCR.rna.exp(results.file = Rseb::qPCR.results.rep1, housekeeping.genes = c("geneB", "housekeeping"), reference.sample = "Ctrl", max.delta.reps = 0.5) rep2 <- Rseb::qPCR.rna.exp(results.file = Rseb::qPCR.results.rep2, housekeeping.genes = "housekeeping", reference.sample = "sample_1", max.delta.reps = 0.3) # Compute the average of the two replicates mean_reps <- Rseb::qPCR.rna.mean.reps(reps.list = list(rep1, rep2))
cat(paste0( " mean_reps\n", " ¦--mean.reps.data.table\n", " ¦ ¦--geneB\n", " ¦ ¦--housekeeping\n", " ¦ °--mean_FC.means\n", " ¦--mean.exp.plots\n", " ¦ ¦--geneB\n", " ¦ °--housekeeping\n", " °--mean.reps.FC.plots\n", " ¦--geneB\n", " ¦--housekeeping\n", " °--mean_FC.means"))
knitr::kable(head(mean_reps$mean.reps.data.table$housekeeping, n = 15), row.names = F, caption = '**RT-qPCR merged replicates: `mean.reps.data.table$housekeeping`**')
cowplot::plot_grid(mean_reps$mean.reps.exp.plots$housekeeping, mean_reps$mean.reps.FC.plots$mean_FC.means, labels = c("AUTO"), nrow = 1)
uniform.x.axis
and uniform.y.axis
have been used to re-format and uniform the axis among a list of plots. These functions allow to re-set the maximum, minimum, ticks interval and number formatting of the plot axes.
uniformed.list.of.plots <- Rseb::uniform.x.axis(plot.list = list.of.plots, x.min = 0, x.max = NA, ticks.each = 0.5, digits = 1)
For this function could be useful to combine multiple lists of plots in a unique one using the function combine.lists
of this package. The latter, from an input list of list, generates an output list result of the combination of all the lists contained in the provided input.
grep
in the shell allows to filter all the rows/lines of a table/file containing a specific pattern in any position. The function grepl.data.frame
from this package mimics the behavior of the shell function in R and applies it to a data.frame. It returns a logic vector of length equal to the rows number of the data.frame in which TRUE
indicates that the pattern have been found in the corresponding line. This vector than could be used to filter the data.frame rows such as using dplyr::filter()
function.
In the following example is shown a dummy table of copy number variation (CNV) individuated in a cohort of patients for a list of genes. For a given gene in a certain patient if a CNV has been found it will be indicated the type of the CNV ("gain", "loss", "gain/loss") otherwise an NA
is used.
# Loading the table data("CNV.data", package = "Rseb") print(CNV.data)
data("CNV.data", package = "Rseb") knitr::kable(CNV.data[round(runif(n = 10, min = 1, max = nrow(CNV.data))),], row.names = F, caption = "**Example of CNV annotation table**")
Now it is possible to filter only the lines containing, in any position, the pattern "gain/loss" using the function grepl.data.frame
.
CNV.gain.loss <- dplyr::filter(.data = CNV.data, Rseb::grepl.data.frame(data.frame = CNV.data, pattern = "gain/loss")) print(CNV.gain.loss)
CNV.gain.loss = dplyr::filter(.data = CNV.data, Rseb::grepl.data.frame(data.frame = CNV.data, pattern = "gain/loss")) knitr::kable(CNV.gain.loss, row.names = F, caption = '**CNV annotation table filtered for the pattern "gain/loss"**')
The R-package have been published on GitHub and a git-pages website is available as well. At both these sites it is possible to find the installation procedure, required dependencies, and the links for changeLog, manual and vignette.
Sebastian Gregoricchio sebastian.gregoricchio@gmail.com
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.