R/plot.density.summary.R

Defines functions plot.density.summary

Documented in plot.density.summary

#' @title Plot the distribution of overall NGS density at specific regions from deepTools matrices.
#'
#' @description Computes the score of each element in a list of regions and generates violins plots with percentiles and the mean (optional) for each sample/region. It uses as input a score matrix computed by deeptools's computeMatrix function or by \link{computeMatrix.deeptools} and \link{density.matrix} functions from this package.
#'
#' @param matrix.file A single string indicating a full path to a matrix.gz file generated by \code{deepTools/computeMatrix} or by \link{computeMatrix.deeptools}, or a list generated by the function \link{read.computeMatrix.file} or \link{density.matrix}.
#' @param plot.by.group Logical value to define whether plot by group of regions or by sample. By default \code{TRUE}.
#' @param missing.data.as.zero Logical value to define whether treat missing data as 0. If set as \code{FALSE} missing data will be converted to \code{NA} and will be excluded from the computations of the signal. By default \code{TRUE}.
#' @param sample.names Samples names could be defined by a string vector. If set as \code{NULL} sample names will be get automatically by the matrix file. By default \code{NULL}. \cr Example: \code{c("sample1", "sample2", "sample3")}
#' @param region.names Region names could be defined by a string vector. If set as \code{NULL} sample names will be get automatically by the matrix file. By default \code{NULL}. \cr Example: \code{c("regionA", "regionB")}
#' @param signal.type String indicating the signal to be computed and plotted. Available parameters are "mean", "median" and "sum". By default "mean".
#' @param linear Logical value to define whether the plots should show the score in linear scale. By default \code{FALSE}.
#' @param error.type String indicating the type of error to be computed and that will be available in the output data.table. Available parameters are "sem" and "sd", standard error mean and standard deviation respectively. By default "sem". Parameter considered only when \code{show.mean = TRUE)}.
#' @param show.mean Logical value to define whether the mean value should be shown as a symbol on the plots. By default \code{TRUE}.
#' @param mean.error.type String indicating the type of error for the mean to be computed. Available parameters are "se", "sd" and, "none". Respectively standard error, standard deviation, and no error plotted. By default "se". Parameter considered only when \code{show.mean = TRUE)}.
#' @param mean.color A single string expressing an R-supported color for the mean symbol. By default \code{"blue"}.
#' @param mean.symbol.shape A numeric value or string defining the shape for the mean symbol. By default \code{20}.
#' @param mean.symbol.size A numeric value defining the size of the mean symbol. By default \code{1}.
#' @param show.stat.multiplot Logical value to define if to add to the plot the statistical comparisons of the means for the groups present in the multiplot. By default \code{TRUE}. All possibile comparisons will be performed.
#' @param stat.method A single string defining the method to use for the statistical comparisons. By default \code{"wilcox.test"}. Available options: "t.test" "wilcox.test".
#' @param stat.paired Logical value to define if the statistical comparisons should be performed paired. By default \code{"FALSE"}. Notice that to allow a paired comparison the number of data should be the same in the two groups compared, so in the most of the cases non applicable to the comparisons between two regions. Used only in \code{"t.test"} and \code{"wilcox.test"} methods.
#' @param stat.labels.format A single string indicating the format of the p-value to show for the statistical comparisons. By default \code{"p.signif"}. Available options: "p.format" (normal p-value), "p.signif" (significance stars), "p.adj" (p-value adjusted).
#' @param stat.hide.ns Logical value indicating if the NS ("Not Significant") comparisons should be shown or not. By default \code{TRUE}.
#' @param stat.p.levels A list containing the p-values levels/thresholds in the following format (default): \code{list(cutpoints = c(0, 0.0001, 0.001, 0.01, 0.05, 1), symbols = c("****", "***", "**", "*", "ns"))}. In other words, we use the following convention for symbols indicating statistical significance:
#' \itemize{
#'   \item \code{ns}: p > 0.05
#'   \item \code{*} p <= 0.05
#'   \item \code{**} p <= 0.01
#'   \item \code{***} p <= 0.001
#'   \item \code{****} p <= 0.0001
#'  }
#' @param title Title of each plot could be defined by a string vector. If set as \code{NULL} titles will be generated automatically. By default \code{NULL}. \cr Example: \code{c("Title1", "Title2")}
#' @param x.lab Single string or string vector to define the X-axis label for all the plots. By default \code{NULL}, the label will be defined automatically.
#' @param y.lab Single string or string vector to define the Y-axis label for all the plots. By default \code{NULL}, the label will be defined automatically.
#' @param x.labs.angle A single numeric value indicating the degrees of rotation of the category labels in the X-axis. By default \code{0}, horizontal without rotation.
#' @param dodge.width Numeric value defining the width of each single violin plot. By default \code{1}.
#' @param border.width Numeric value to define the border width for all the violin plots. By default \code{0.5}.
#' @param border.color A single string indicating the color to use for the border of the violin plots. By default \code{"#000000"} (full black).
#' @param transparency A numeric value to define the fraction of transparency of the plots fill (0 = transparent, 1 = full). By default \code{0.5}.
#' @param subset.range A numeric vector indicating the range to which restrict the analyses (eg. c(-150, 250)). In the case of "scale-region" mode, the range is represented by (-upstream | 0 | body_length | body_length+downstream).By default \code{NULL}: the whole region is considered.
#' @param y.lim List of numeric vectors with two elements each to define the range of the Y-axis. To set only one side use NA for the side to leave automatic. If only one range is given this one will be applied to all the plots. By default \code{NULL}, the range will be defined automatically. \cr Example \code{list(c(0, 20), c(NA, 30), c(0, NA), c(NA, NA))}.,
#' @param y.identical.auto Logical value to define whether use the same Y-axis range for all the plots automatically depending on the values. Not used when \code{y.lim} is not \code{NULL}. By default \code{TRUE}.
#' @param y.ticks.interval A number indicating the interval/bin spacing two ticks on the Y-axis. By default \code{NULL}: ticks are assigned automatically. Active only when \code{y.identical.auto = TRUE} and \code{y.lim != NULL}.
#' @param y.digits A numeric value to define the number of digits to use for the y.axis values. By default \code{1} (eg. 1.5).
#' @param axis.line.width Numeric value to define the axes and ticks line width for all plots. By default \code{0.5}.
#' @param text.size Numeric value to define the size of the text for the labels of all the plots. By default 12.
#' @param legend.position Any ggplot supported value for the legend position (eg. "none, "top", "bottom", "left", "right", c(fraction.x, fraction.y)). By default \code{c(0.2, 0.85)}.
#' @param colors Vector to define the line and error area colors. If only one value is provided it will applied to all the samples/groups. If the number of values is lower than the the required one, a random set of colors will be generated. All standard R.colors values are accepted. By default \code{c("#00A5CF", "#F8766D", "#AC88FF", "#E08B00", "#00BA38", "#BB9D00", "#FF61C9", "gray30")}.
#' @param n.row.multiplot Numeric value to define the number of rows in the final multiplot.
#' @param multiplot.export.file If a string with the name of a PDF file is provided the multiplot will be exported. By default \code{NULL}.
#' @param real.width.single.violinplot Numeric value, in inches, to define the real width (not precise) of each single violin plot in the multiplot exported, if required. By default 1 inch.
#' @param real.height.single.violinplot Numeric value, in inches, to define the real height (not precise) of each single violin plot in the multiplot exported, if required. By default 3.5 inches.
#' @param by.row Logical value to define whether the plots should be arranged by row. By default \code{TRUE}.
#' @param print.multiplot Logical value to define whether to print the multiplot once generated. By default \code{FALSE}.
#'
#' @return The function returns a list containing:
#' \itemize{
#'   \item \code{data.table} with the computed values used for the plot;
#'   \item \code{metadata} table with the information obtained from the matrix_file.gz;
#'   \item \code{plot.list} with a plot for each list element;
#'   \item \code{density.profile} with the density profile of the mean signal generated by \link{plot.density.profile} corresponding to the regions/samples for which the summary multiplot have been generated;
#'   \item \code{multiplot} with the image of all the plots together;
#'   \item \code{summary.plot.samples} with a plot showing the scores of all regions per each sample;
#'   \item \code{summary.plot.regions} with a plot showing the scores of all samples per each region;
#'   \item \code{means.comparisons} table with the statistical means comparisons (when \code{show.stat.multiplot = TRUE}, otherwise a string is returned).
#'  }
#'
#'
#' @details To know more about the deepTools's function \code{computeMatrix} see the package manual at the following link: \cr \url{https://deeptools.readthedocs.io/en/develop/content/tools/computeMatrix.html}.
#'
#' @export plot.density.summary
#'
# @import dplyr
# @import ggplot2
# @importFrom data.table fread
# @importFrom ggpubr compare_means stat_compare_means
# @importFrom stringr str_split
# @importFrom robustbase colMedians
# @importFrom matrixStats colSds
# @importFrom purrr reduce
# @importFrom cowplot plot_grid
# @importFrom tools toTitleCase

plot.density.summary = function(
  matrix.file,
  plot.by.group = T,
  missing.data.as.zero = NULL,
  sample.names = NULL,
  region.names = NULL,
  signal.type = "mean", # type of signal computation c("mean", "median", "sum")
  linear = F,
  error.type = "sem",
  show.mean = T,
  mean.error.type = "se",
  mean.color = "blue",
  mean.symbol.shape = 20,
  mean.symbol.size = 1,
  show.stat.multiplot = T,
  stat.method = "wilcox.test",
  stat.paired = F,
  stat.labels.format = "p.signif",
  stat.hide.ns = T,
  stat.p.levels = list(cutpoints = c(0, 0.0001, 0.001, 0.01, 0.05, 1),
                       symbols = c("****", "***", "**", "*", "ns")),
  title = NULL, # string vector
  x.lab = NULL, # single string vector
  y.lab = NULL, # string vector
  x.labs.angle = 0,
  dodge.width = 1,
  border.width = 0.5, # can be a vector for different lines
  border.color = "#000000",
  transparency = 0.5,
  subset.range = NULL,
  y.lim = NULL, # list of vectors c(min, max) or single vector
  y.identical.auto = T, # subordinated to y.lim which has the priority
  y.ticks.interval = NULL, # only for 'y.identical.auto'
  y.digits = 1,
  axis.line.width = 0.5,
  text.size = 12,
  legend.position = c(0.2, 0.85),
  colors = c("#00A5CF", "#F8766D", "#AC88FF", "#E08B00", "#00BA38", "#BB9D00", "#FF61C9", "gray30"),
  n.row.multiplot = 1,
  multiplot.export.file = NULL,
  real.width.single.violinplot = 1, #inch
  real.height.single.violinplot = 3.5, #inch
  by.row = TRUE,
  print.multiplot = F) {

  ##############################################################################
  ## ------------------------- BEGIN OF THE FUNCTION ------------------------ ##
  ##############################################################################

  # Load all libraries
  require(dplyr)
  require(ggplot2)
  # require(ggpubr)
  # require(data.table)
  # require(stringr)
  # require(robustbase)
  # require(matrixStats)
  # require(purrr)
  # require(cowplot)
  # require(tools)

  # Check if Rseb is up-to-date #
  Rseb::actualize(update = F, verbose = F)

  ##############################################################################
  ## Convert and check y.lim and title
  ### Check that the parameters are vectors or lists
  if (!is.null(y.lim) & !(class(y.lim) %in% c("numeric", "list"))) {
    return(warning("The 'y.lim' parameter must be a vector or list of numeric vectors c(min, max) or NULL. \nTo set just one boundary replace a number with NA."))
  }


  ### Check whether parameters are single vectors and convert them to 1-element lists eventually
  if (!is.null(y.lim) & class(y.lim) == "numeric") {y.lim = list(y.lim)}


  ### Check x.lab and y.lab class
  if (!is.null(title) & class(title) != "character") {return(warning("The 'title' parameter must be a single string or a string vector."))}
  if (!is.null(x.lab) & class(x.lab) != "character") {return(warning("The 'x.lab' parameter must be a single string or a string vector."))}
  if (!is.null(y.lab) & class(y.lab) != "character") {return(warning("The 'y.lab' parameter must be a single string or a string vector."))}


  ### Check the size and type of each element in the lists
  if (!is.null(y.lim)) {
    for (i in 1:length(y.lim)) {
      if (length(y.lim[[i]]) != 2 | class(y.lim[[i]]) != "numeric") {
        return(warning("One ore more vectors of the 'y.lim' list do not have exactly 2 elements and/or are not numeric vectors."))}
    }
  }

  ### Check title vector
  if (!is.null(title) & class(title) != "character") {return(warning("The 'title' parameter must be a string vector or NULL."))}


  ### Check that the signal method is allowed
  if (class(signal.type) != "character" | !(signal.type %in% c("mean", "median", "sum")) | length(signal.type) != 1) {return(warning("The 'siganl.type' parameter must be one among [mean, median, sum]."))}

  ### Check colors
  if (!Rseb::is.color(mean.color) | length(mean.color) != 1) {return(warning("The 'mean.color' parameter must be a single R-supported color string."))}
  if (F %in% Rseb::is.color(colors)) {return(warning("The 'colors' parameter must be a single string or a vector of strings including R-supported colors."))}

  ### Check errors parameters
  if (length(error.type == 1)) {
    if (!(error.type %in% c("sem", "sd", "none"))) {
      return(warning("The 'error.type' must be one among 'sem' or 'sd'."))
    }
  } else {return(warning("The 'error.type' must be a single string among 'sem' or 'sd'."))}

  if (length(mean.error.type == 1)) {
    if (!(mean.error.type %in% c("se", "sd", "none"))) {
      return(warning("The 'mean.error.type' must be one among 'se', 'sd' or 'none'."))
    }
  } else {return(warning("The 'mean.error.type' must be a single string among 'se', 'sd' or 'none'."))}

  # Check stat.method
  if (!(stat.method %in% c("t.test", "wilcox.test")) | length(stat.method) != 1 | class(stat.method) != "character") {
    return(warning("The 'stat.method' parameter must be one among 't.test' and 'wilcox.test'")) }

  ##############################################################################
  # Import/read the matrix.gz file
  if (class(matrix.file) == "character" & length(matrix.file) == 1) {
    m = Rseb::read.computeMatrix.file(matrix.file)
    metadata = m$metadata
    matrix.data = m$matrix.data
  } else if (class(matrix.file) == "list" & length(matrix.file) == 3 & unique(names(matrix.file) == c("metadata", "matrix.data", "original.file.path"))) {
    metadata = matrix.file$metadata
    matrix.data = matrix.file$matrix.data
  } else {
    return(warning("The 'matrix.file' must be a single string indicating a full path to a matrix.gz file generated by deepTools/computeMatrix or by Rseb::computeMatrix.deeptools(), or a list generated by the function Rseb::read.computeMatrix.file."))
  }

  # Generate metadata variables
  sample_start_column = as.numeric(as.vector(stringr::str_split(string = dplyr::filter(.data = metadata, parameters == "sample_boundaries")$values, pattern = ",")[[1]]))
  sample_names = as.vector(stringr::str_split(string = dplyr::filter(.data = metadata, parameters == "sample_labels")$values, pattern = ",")[[1]])

  group_start_row = as.numeric(as.vector(stringr::str_split(string = dplyr::filter(.data = metadata, parameters == "group_boundaries")$values, pattern = ",")[[1]]))
  group_names = as.vector(stringr::str_split(string = dplyr::filter(.data = metadata, parameters == "group_labels")$values, pattern = ",")[[1]])

  ##############################################################################
  # Get the number at which the matrix data starts (after 'chr' 'start' 'end' 'strand' etc.)
  number.info.columns = ncol(matrix.data) - sample_start_column[length(sample_start_column)]

  # Remove NaN values and substitute them with 0 if missing.data.as.zero = TRUE or NA if FALSE
  missing.data.as.zero.automatic = as.logical(toupper(dplyr::filter(.data = metadata, parameters == "missing_data_as_zero")$values))

  if (is.null(missing.data.as.zero)) {
    missing.data.as.zero = missing.data.as.zero.automatic
  } else if (is.logical(missing.data.as.zero)) {
    if (missing.data.as.zero != missing.data.as.zero.automatic) {
      message(paste("The parameter 'missing.data.as.zero' set in this function as ", as.character(missing.data.as.zero),
                    ", differs from the automatic obtained from the matrix file which is set as ", as.character(missing.data.as.zero.automatic),
                    ".\nMissing data will be treated as set in this function: 'missing.data.as.zero = ", as.character(missing.data.as.zero), "'.", sep = ""))}
  } else {return(warning("The parameter 'missing.data.as.zero' must be a logical value or NULL."))}

  is.nan.data.frame = function(data.frame) do.call(cbind, lapply(data.frame, is.nan))

  matrix.data[is.nan(matrix.data)] = ifelse(test = missing.data.as.zero,
                                            yes = 0,
                                            no = NA)

  message(ifelse(test = missing.data.as.zero,
                 yes = "Missing data have been converted to 0 since the parameter 'missing.data.as.zero' is TRUE.",
                 no = "Missing data have been converted to NA since the parameter 'missing.data.as.zero' is FALSE. \nNA values will be excluded from the statistical computations."))

  ##############################################################################
  # Generate a list with a table per sample
  samples.table.list = list()

  # Change sample and group names by custom ones
  if (!is.null(sample.names)) {
    if (length(unique(sample.names)) == length(sample_names)) {
      sample_names = sample.names
    } else {message("Default sample.names have been used because custom valus are not unique and/or incomplete")}
  }

  if (!is.null(region.names)) {
    if (length(unique(region.names)) == length(group_names)) {
      group_names = region.names
    } else {message("Default region.names have been used because custom valus are not unique and/or incomplete")}
  }

  for (i in 1:length(sample_names)) {
    # define the limits of the sample keeping the first lines
    start.col = sample_start_column[i] + number.info.columns + 1
    end.col = sample_start_column[i+1] + number.info.columns

    sample.table =
      matrix.data %>%
      dplyr::select(c(1:number.info.columns, start.col:end.col)) %>%
      # adding a column with the groups names
      mutate(group = rep(group_names,
                         times = c(group_start_row[-1] - group_start_row[-length(group_start_row)])))

    # Add the single table to a list
    samples.table.list[[i]] = sample.table
  }

  names(samples.table.list) = sample_names

  ##############################################################################
  # Generate the statistical tables
  sample.stat.table.list = list()

  for (s in 1:length(sample_names)) {
    group.stats.table.list = list()

    # x-axes values, distance from center
    upstream = as.numeric(as.vector(stringr::str_split(string = dplyr::filter(.data = metadata, parameters == "upstream")$values, pattern = ",")[[1]]))[s]
    downstream = as.numeric(as.vector(stringr::str_split(string = dplyr::filter(.data = metadata, parameters == "downstream")$values, pattern = ",")[[1]]))[s]
    body = as.numeric(as.vector(stringr::str_split(string = dplyr::filter(.data = metadata, parameters == "body")$values, pattern = ",")[[1]]))[s]
    bin = as.numeric(as.vector(stringr::str_split(string = dplyr::filter(.data = metadata, parameters == "bin_size")$values, pattern = ",")[[1]]))[s]
    distance.range = seq(from = -upstream, to = (body + downstream), by = bin)
    distance.range = distance.range[distance.range != 0]

    # Computations of the stats per group in order to then merge the different group tables
    for (g in 1:length(group_names)) {
      group.table =
        samples.table.list[[s]] %>%
        filter(group == group_names[g]) %>%
        dplyr::select(-c(1:number.info.columns), -group)

      # Subset the group table
      colnames(group.table) = distance.range

      if (!is.null(subset.range)) {
        subset.distance = distance.range[distance.range >= range(subset.range)[1]]
        subset.distance = subset.distance[subset.distance <= range(subset.range)[2]]
        group.table = group.table[, (colnames(group.table) %in% as.character(subset.distance))]
      }

      group.mean = rowMeans(group.table, na.rm = T)
      group.median = robustbase::rowMedians(as.matrix(group.table), na.rm = T)
      group.sum = rowSums(group.table, na.rm = T)
      group.sd = matrixStats::rowSds(as.matrix(group.table), na.rm = T)
      group.sem = group.sd / sqrt(ncol(group.table))

      group.stats.table.list[[g]] =
        data.frame(group = group_names[g],
                   mean.by.region = group.mean,
                   median.by.region = group.median,
                   sum.by.region = group.sum,
                   sd.by.region = group.sd,
                   sem.by.region = group.sem)
    } # 'g' For end


    sample.stat.table.list[[s]] =
      purrr::reduce(.x = group.stats.table.list, .f = rbind) %>%
      mutate(sample = sample_names[s])
  } # 's' For end


  # Merge of all the tables in a single one
  full.stat.table = purrr::reduce(.x = sample.stat.table.list, .f = rbind)
  full.stat.table =
    Rseb::move.df.col(data.frame = full.stat.table, move.command = "sample first") %>%
    dplyr::mutate(sample = factor(sample, levels = sample_names),
                  group = factor(group, levels = group_names))


  #############################################################################
  # Generation of the plots
  plot.list = list()
  comparisons.table = list()

  # # Remove error plotting option when using the 'sum' mode, since it means nothing for a sum
  # if (signal.type == "sum") {
  #   plot.error = F
  #   message("The error will not be plotted since mode 'sum' is used.")
  # }

  # Function to check the length of y.lim, title, etc.
  check.parameter.length =
    function(parameter, parameter.name, plots.number) {
      if (!is.null(parameter) & length(parameter) != plots.number & length(parameter) > 1) {
        message(paste("The parameter", parameter.name, "has been set as NULL since the number of elements was different from the number of plots required."))
        return(NULL)}
      else {return(parameter)}
    }

  # Function to generate the comparisons list for the statistics test
  comparisons = function(table, score_column, groups_column, hide.ns, stat.p.levels, stat.method, stat.paired) {
    # Reshape the table
    tb = table[,c(score_column, groups_column)]
    names(tb) = c("score", "groupsToCompare")

    # Generate the comparisons
    comp = ggpubr::compare_means(formula = score ~ groupsToCompare,
                                 data = tb,
                                 symnum.args = stat.p.levels,
                                 method = stat.method,
                                 paired = stat.paired)
    comparisons_data = data.frame(comp)

    if (hide.ns == T) {comp = dplyr::filter(.data = comp, tolower(p.signif) != "ns")}

    # Generate the comparisons
    if (nrow(comp) != 0) {
      comp_list = list()
      for (i in 1:nrow(comp)) {
        comp_list[[i]] = c(comp$group1[i], comp$group2[i])
      }
    } else {comp_list = list()}


    return(list(comparisons_data = comparisons_data,
                comparisons_to_plot = comp_list))
  }


  ###########  SINGLE PLOTS FOR EACH CONDITION  ##############

  # Define the dodge to align the mean to the violins
  dodge = position_dodge(width = dodge.width)

  # Define the range used
  range_used = ifelse(test = is.null(subset.range),
                      yes = paste("range: [", range(distance.range)[1], ";", range(distance.range)[2], "] bp", sep = ""),
                      no = paste("range: [", range(subset.range)[1], ";", range(subset.range)[2], "] bp", sep = ""))

  ### Generate a plot by group (each plot is a group, each signal/line is a sample)
  if (plot.by.group == T) {

    # Define the colors for the samples
    if ((length(colors) > 1 & length(colors) < length(sample_names)) | length(colors) == 0) {colors = rainbow(length(sample_names))}


    for (i in 1:length(group_names)) {
      # Repeat the value of y.lim, x.lab, y.lab, title, parameters if their length is 1
      if (!is.null(y.lim) & length(y.lim) == 1) {y.lim = lapply(1:length(group_names), function(j) y.lim[[1]])}
      if (!is.null(title) & length(title) == 1) {title = rep(title, length(group_names))}
      if (!is.null(x.lab) & length(x.lab) == 1) {x.lab = rep(x.lab, length(group_names))}
      if (!is.null(y.lab) & length(y.lab) == 1) {y.lab = rep(y.lab, length(group_names))}

      # Check the length of the y.lim, title, x/y.lab parameters
      y.lim = check.parameter.length(parameter = y.lim, parameter.name = "'y.lim'", plots.number = length(group_names))
      title = check.parameter.length(parameter = title, parameter.name = "'title'", plots.number = length(group_names))
      x.lab = check.parameter.length(parameter = x.lab, parameter.name = "'x.lab'", plots.number = length(group_names))
      y.lab = check.parameter.length(parameter = y.lab, parameter.name = "'y.lab'", plots.number = length(group_names))

      # Select only current group subtable
      current.table = full.stat.table %>% filter(group == group_names[i])

      # Define which value to use for the scores
      if (signal.type == "mean") {
        score = current.table$mean.by.region} else if (signal.type == "median") {
          score = current.table$median.by.region} else if (signal.type == "sum") {
            score = current.table$sum.by.region} else {
              return(warning("The signal.type must be one among 'mean', 'median', 'sum'."))
            }

      # Define which value to use for error
      if (error.type != "none") {
        if (error.type == "sem") {
          error = current.table$sem.by.region} else if (error.type == "sd") {
            error = current.table$sd.by.region}
      } else if (error.type == "none") {
        error = current.table$sem.by.region
      } else {return(warning("The error.type must be one between 'sem', 'sd'."))}

      # Add a column with the values to be used as generic variable name
      current.table =
        current.table %>%
        mutate(score = score,
               error = error)


      # Convert score to log2(score+1) if required, assign y_label (used when y.lab == NULL)
      if (linear == F) {
        current.table$score = log2(current.table$score + 1)
        y_label = bquote(atop(paste('log'['2']*'(', .(signal.type), ' density signal + 1)'),
                              .(range_used)))
      } else {y_label = bquote(atop(paste(.(stringr::str_to_title(signal.type)), ' density signal'),
                                    .(range_used)))}


      y_label = if (is.null(y.lab)) {y_label = y_label} else {y.lab[i]}


      # Plot saving in the list
      plot.list[[i]] =
        ggplot(data = current.table,
               aes(x = sample,
                   y = score,
                   fill = sample)) +
        geom_violin(alpha = transparency,
                    linetype = "dashed",
                    draw_quantiles = c(0.25, 0.75),
                    position = dodge,
                    size = border.width,
                    color = border.color) +
        geom_violin(alpha = 0,
                    draw_quantiles = 0.5,
                    position = dodge,
                    size = border.width,
                    color = border.color) +
        xlab(ifelse(test = is.null(x.lab),
                    yes = "Samples",
                    no = x.lab[i])) +
        ylab(y_label) +
        scale_fill_manual(values = colors) +
        ggtitle(ifelse(test = is.null(title),
                       yes = group_names[i],
                       no = title[i])) +
        theme_classic() +
        theme(text = element_text(size = text.size),
              axis.text = element_text(color = "#000000"),
              axis.text.x = element_text(angle = x.labs.angle,
                                         hjust = ifelse(test = x.labs.angle == 0,
                                                        yes = 0.5,
                                                        no = 1),
                                         vjust = ifelse(test = x.labs.angle == 0,
                                                        yes = 0.5,
                                                        no = 1)),
              axis.title = element_text(color = "#000000"),
              axis.line = element_line(color = "#000000", size = axis.line.width),
              axis.ticks = element_line(color = "#000000", size = axis.line.width),
              title = element_text(color = "#000000"),
              legend.title = element_text(color = "#000000"),
              legend.text = element_text(color = "#000000"),
              legend.position = legend.position,
              legend.background = element_blank())


      # Change y.lim if required
      if (!is.null(y.lim[[i]])) {
        plot.list[[i]] = plot.list[[i]] + ylim(y.lim[[i]])
      }

      # if show.mean == T add the mean spot
      if (show.mean == T) {
        if (mean.error.type != "none") {
          plot.list[[i]] =
            plot.list[[i]] +
            stat_summary(fun.data = ifelse(test = mean.error.type == "se",
                                           yes = "mean_se",
                                           no = "mean_sdl"),
                         col = mean.color,
                         size = mean.symbol.size,
                         show.legend = F,
                         shape = mean.symbol.shape,
                         geom = "pointrange")
        } else {
          plot.list[[i]] =
            plot.list[[i]] +
            stat_summary(fun = "mean",
                         col = mean.color,
                         size = mean.symbol.size,
                         show.legend = F,
                         shape = mean.symbol.shape,
                         geom = "pointrange")
        }

        # Compute the comparisons, and if show.stat.multiplot == T add the p-values bars
        if (length(unique(current.table$sample)) > 1) {
          my_comparisons = comparisons(current.table, "score", "sample", hide.ns = stat.hide.ns, stat.p.levels = stat.p.levels, stat.method = stat.method, stat.paired = stat.paired)
          comparisons.table[[i]] =
            my_comparisons$comparisons_data %>%
            mutate(.y. = signal.type,
                   region = unique(current.table$group))
          comparisons.table[[i]] = Rseb::move.df.col(data.frame = comparisons.table[[i]], move.command = "region first")
          colnames(comparisons.table[[i]]) = c("region", "signal.type", "sample.1", "sample.2", "p", "p.adj", "p.format", "p.signif", "method")

        } else if (length(unique(current.table$sample)) == 1) {
          comparisons.table[[i]] = data.frame(region = unique(current.table$group),
                                              signal.type = signal.type,
                                              sample.1 = unique(current.table$sample),
                                              sample.2 = NA,
                                              p = NA, p.adj = NA, p.format = NA, p.signif = NA, method = NA)
        }

        ## Add significant bars/p-stars
        if (show.stat.multiplot == T & length(unique(current.table$sample)) > 1) {
          if (length(my_comparisons$comparisons_to_plot) >= 1) {
            plot.list[[i]] =
              plot.list[[i]] +
              ggpubr::stat_compare_means(comparisons = my_comparisons$comparisons_to_plot,
                                         paired = stat.paired,
                                         method = stat.method,
                                         symnum.args = stat.p.levels,
                                         label = stat.labels.format,
                                         hide.ns = stat.hide.ns,
                                         size = text.size * 0.6,
                                         color = "#000000")
          }
        } # END all stat comparisons

      }

      # Add a name to the plot, element in the list
      names(plot.list)[i] = group_names[i]

    } # for loop end

    ########
    ### Generate a plot by sample (each plot is a sample, each signal/line is a group)
  } else { # if end

    # Define the colors for each group
    if ((length(colors) > 1 & length(colors) < length(group_names)) | length(colors) == 0) {colors = rainbow(length(group_names))}

    for (i in 1:length(sample_names)) {
      # Repeat the value of y.lim, x.lab, y.lab, title, parameters if their length is 1
      if (!is.null(y.lim) & length(y.lim) == 1) {y.lim = lapply(1:length(sample_names), function(j) y.lim[[1]])}
      if (!is.null(title) & length(title) == 1) {title = rep(title, length(sample_names))}
      if (!is.null(x.lab) & length(x.lab) == 1) {x.lab = rep(x.lab, length(sample_names))}
      if (!is.null(y.lab) & length(y.lab) == 1) {y.lab = rep(y.lab, length(sample_names))}

      # Check the length of the y.lim, title, x/y.lab parameters
      y.lim = check.parameter.length(parameter = y.lim, parameter.name = "'y.lim'", plots.number = length(sample_names))
      title = check.parameter.length(parameter = title, parameter.name = "'title'", plots.number = length(sample_names))
      x.lab = check.parameter.length(parameter = x.lab, parameter.name = "'x.lab'", plots.number = length(sample_names))
      y.lab = check.parameter.length(parameter = y.lab, parameter.name = "'y.lab'", plots.number = length(sample_names))

      # Select only current sample subtable
      current.table = full.stat.table %>% filter(sample == sample_names[i])

      # Define which value to use for the scores
      if (signal.type == "mean") {
        score = current.table$mean.by.region} else if (signal.type == "median") {
          score = current.table$median.by.region} else if (signal.type == "sum") {
            score = current.table$sum.by.region} else {
              return(warning("The signal.type must be one among 'mean', 'median', 'sum'"))
            }

      # Define which value to use for error
      if (error.type != "none") {
        if (error.type == "sem") {
          error = current.table$sem.by.region} else if (error.type == "sd") {
            error = current.table$sd.by.region}
      } else if (error.type == "none") {
        error = current.table$sem.by.region
      } else {return(warning("The error.type must be one between 'sem' or 'sd'."))}

      # Add a column with the values to be used as generic variable name
      current.table =
        current.table %>%
        mutate(score = score,
               error = error)

      # Convert score to log2(score+1) if required, assign y_label (used when y.lab == NULL)
      if (linear == F) {
        current.table$score = log2(current.table$score + 1)
        y_label = bquote(atop(paste('log'['2']*'(', .(signal.type), ' density signal + 1)'),
                              .(range_used)))
      } else {y_label = bquote(atop(paste(.(stringr::str_to_title(signal.type)), ' density signal'),
                                    .(range_used)))}

      y_label = if (is.null(y.lab)) {y_label = y_label} else {y.lab[i]}

      # Plot saving in the list
      plot.list[[i]] =
        ggplot(data = current.table,
               aes(x = group,
                   y = score,
                   fill = group)) +
        geom_violin(alpha = transparency,
                    linetype = "dashed",
                    draw_quantiles = c(0.25, 0.75),
                    position = dodge,
                    size = border.width,
                    color = border.color) +
        geom_violin(alpha = 0,
                    draw_quantiles = 0.5,
                    position = dodge,
                    size = border.width,
                    color = border.color) +
        xlab(ifelse(test = is.null(x.lab),
                    yes = "Regions",
                    no = x.lab[i])) +
        ylab(y_label) +
        scale_fill_manual(values = colors) +
        ggtitle(ifelse(test = is.null(title),
                       yes = sample_names[i],
                       no = title[i])) +
        theme_classic() +
        theme(text = element_text(size = text.size),
              axis.text = element_text(color = "#000000"),
              axis.text.x = element_text(angle = x.labs.angle,
                                         hjust = ifelse(test = x.labs.angle == 0,
                                                        yes = 0.5,
                                                        no = 1),
                                         vjust = ifelse(test = x.labs.angle == 0,
                                                        yes = 0.5,
                                                        no = 1)),
              axis.title = element_text(color = "#000000"),
              axis.line = element_line(color = "#000000", size = axis.line.width),
              axis.ticks = element_line(color = "#000000", size = axis.line.width),
              title = element_text(color = "#000000"),
              legend.title = element_text(color = "#000000"),
              legend.text = element_text(color = "#000000"),
              legend.position = legend.position,
              legend.background = element_blank())


      # Change y.lim if required
      if (!is.null(y.lim[[i]])) {
        plot.list[[i]] = plot.list[[i]] + ylim(y.lim[[i]])
      }

      # if show.mean == T add the mean spot
      if (show.mean == T) {
        if (mean.error.type != "none") {
          plot.list[[i]] =
            plot.list[[i]] +
            stat_summary(fun.data = ifelse(test = mean.error.type == "se",
                                           yes = "mean_se",
                                           no = "mean_sdl"),
                         col = mean.color,
                         size = mean.symbol.size,
                         show.legend = F,
                         shape = mean.symbol.shape,
                         geom = "pointrange")
        } else {
            plot.list[[i]] =
              plot.list[[i]] +
              stat_summary(fun = "mean",
                           col = mean.color,
                           size = mean.symbol.size,
                           show.legend = F,
                           shape = mean.symbol.shape,
                           geom = "pointrange")
        }

        # Compute the comparisons, and if show.stat.multiplot == T add the p-values bars
        if (length(unique(current.table$group)) > 1) {
          my_comparisons = comparisons(current.table, "score", "group", hide.ns = stat.hide.ns, stat.p.levels = stat.p.levels, stat.method = stat.method, stat.paired = stat.paired)
          comparisons.table[[i]] =
            my_comparisons$comparisons_data %>%
            mutate(.y. = signal.type,
                   sample = unique(current.table$sample))
          comparisons.table[[i]] = Rseb::move.df.col(data.frame = comparisons.table[[i]], move.command = "sample first")
          colnames(comparisons.table[[i]]) = c("sample", "signal.type", "region.1", "region.2", "p", "p.adj", "p.format", "p.signif", "method")

        } else if (length(unique(current.table$group)) == 1) {
          comparisons.table[[i]] = data.frame(sample = unique(current.table$sample),
                                              signal.type = signal.type,
                                              region.1 = unique(current.table$group),
                                              region.2 = NA,
                                              p = NA, p.adj = NA, p.format = NA, p.signif = NA, method = NA)
        }

        if (show.stat.multiplot == T & length(unique(current.table$group)) > 1) {
          if (length(my_comparisons$comparisons_to_plot) >= 1) {
            plot.list[[i]] =
              plot.list[[i]] +
              ggpubr::stat_compare_means(comparisons = my_comparisons$comparisons_to_plot,
                                         paired = stat.paired,
                                         method = stat.method,
                                         symnum.args = stat.p.levels,
                                         label = stat.labels.format,
                                         hide.ns = stat.hide.ns,
                                         size = text.size * 0.6,
                                         color = "#000000")
          }
        }  # END all stat comparisons

      }

      # Add a name to the plot, element in the list
      names(plot.list)[i] = sample_names[i]

    } # for end
  } # else end


  # Set the same Y-axes for all the plots, if required by 'y.identical.auto = T'
  if (y.identical.auto == T & is.null(y.lim) & length(plot.list) > 1) {
    plot.list = Rseb::uniform.y.axis(plot.list = plot.list, ticks.each = y.ticks.interval, digits = y.digits)
  }


  ##############################################################################
  # generate the multiplot
  multiplot = cowplot::plot_grid(plotlist = plot.list, nrow = n.row.multiplot, byrow = by.row, align = "hv")

  # Print multiplot if required
  if (print.multiplot == T) {print(multiplot)}

  # Export the file if required
  n_violins = ifelse(test = plot.by.group == T,
                     yes = length(sample_names),
                     no = length(group_names))

  if (!is.null(multiplot.export.file)) {
    pdf(file = multiplot.export.file,
        height = real.height.single.violinplot * n.row.multiplot,
        width = (real.width.single.violinplot * n_violins) * ceiling(length(plot.list)/n.row.multiplot))
    print(multiplot)
    dev.off()
    message("Multiplot exported as: ", multiplot.export.file)
  }


  ##############################################################################
  # Generation of the summary plots

  ## Generate new table
  summary_table = full.stat.table

  ## Define which value to use for the scores
  if (signal.type == "mean") {
    score = summary_table$mean.by.region} else if (signal.type == "median") {
      score = summary_table$median.by.region} else if (signal.type == "sum") {
        score = summary_table$sum.by.region} else {
          return(warning("The signal.type must be one among 'mean', 'median', 'sum'."))
        }

  # Define which value to use for error
  if (error.type != "none") {
    if (error.type == "sem") {
      error = summary_table$sem.by.region} else if (error.type == "sd") {
        error = summary_table$sd.by.region}
    } else if (error.type == "none") {
      error = summary_table$sem.by.region
    } else {return(warning("The error.type must be one between 'sem' or 'sd'."))}


  # Add a column with the values to be used as generic variable name
  summary_table =
    summary_table %>%
    mutate(score = score,
           error = error)

  # Convert score to log2(score+1) if required, assign y_label_summary
  if (linear == F) {
    summary_table$score = log2(summary_table$score + 1)
    y_label_summary = bquote(atop(paste('log'['2']*'(', .(signal.type), ' density signal + 1)'),
                                 .(range_used)))
  } else {y_label_summary = bquote(atop(paste(.(stringr::str_to_title(signal.type)), ' density signal'),
                                        .(range_used)))}


  # Define the dodge to align the mean to the violins
  dodge = position_dodge(width = dodge.width)

  ###### Summary plot by group
  # Define the colors for each sample
  if ((length(colors) > 1 & length(colors) < length(sample_names)) | length(colors) == 0) {
    sample_colors = rainbow(length(group_names))} else {
      sample_colors = colors
    }

  summary.plot.groups =
    ggplot(data = summary_table,
           aes(x = group,
               y = score,
               fill = sample)) +
    geom_violin(alpha = transparency,
                linetype = "dashed",
                draw_quantiles = c(0.25, 0.75),
                position = dodge,
                size = border.width,
                color = border.color) +
    geom_violin(alpha = 0,
                draw_quantiles = 0.5,
                position = dodge,
                size = border.width,
                color = border.color) +
    xlab("Regions") +
    ylab(y_label_summary) +
    scale_fill_manual(values = sample_colors) +
    ggtitle("Summary plot by region") +
    theme_classic() +
    theme(text = element_text(size = text.size),
          axis.text = element_text(color = "#000000"),
          axis.text.x = element_text(angle = x.labs.angle,
                                     hjust = ifelse(test = x.labs.angle == 0,
                                                    yes = 0.5,
                                                    no = 1),
                                     vjust = ifelse(test = x.labs.angle == 0,
                                                    yes = 0.5,
                                                    no = 1)),
          axis.title = element_text(color = "#000000"),
          axis.line = element_line(color = "#000000", size = axis.line.width),
          axis.ticks = element_line(color = "#000000", size = axis.line.width),
          title = element_text(color = "#000000"),
          legend.title = element_text(color = "#000000"),
          legend.text = element_text(color = "#000000"),
          legend.position = legend.position,
          legend.background = element_blank())

  # if show.mean == T add the mean spot
  if (show.mean == T) {
    if (mean.error.type != "none") {
      summary.plot.groups =
        summary.plot.groups +
        stat_summary(fun.data = ifelse(test = mean.error.type == "se",
                                       yes = "mean_se",
                                       no = "mean_sdl"),
                     col = mean.color,
                     size = mean.symbol.size,
                     show.legend = F,
                     shape = mean.symbol.shape,
                     geom = "pointrange",
                     position = dodge)
    } else {
      summary.plot.groups =
        summary.plot.groups +
        stat_summary(fun = "mean",
                     col = mean.color,
                     size = mean.symbol.size,
                     show.legend = F,
                     shape = mean.symbol.shape,
                     geom = "pointrange",
                     position = dodge)
    }

  }


  ###### Summary plot by sample
  # Define the colors for each group
  if ((length(colors) > 1 & length(colors) < length(group_names)) | length(colors) == 0) {
    group_colors = rainbow(length(group_names))} else {
      group_colors = colors
    }

  summary.plot.samples =
    ggplot(data = summary_table,
           aes(x = sample,
               y = score,
               fill = group)) +
    geom_violin(alpha = transparency,
                linetype = "dashed",
                draw_quantiles = c(0.25, 0.75),
                position = dodge,
                size = border.width,
                color = border.color) +
    geom_violin(alpha = 0,
                draw_quantiles = 0.5,
                position = dodge,
                size = border.width,
                color = border.color) +
    xlab("Samples") +
    ylab(y_label_summary) +
    scale_fill_manual(values = group_colors) +
    ggtitle("Summary plot by sample") +
    theme_classic() +
    theme(text = element_text(size = text.size),
          axis.text = element_text(color = "#000000"),
          axis.text.x = element_text(angle = x.labs.angle,
                                     hjust = ifelse(test = x.labs.angle == 0,
                                                    yes = 0.5,
                                                    no = 1),
                                     vjust = ifelse(test = x.labs.angle == 0,
                                                    yes = 0.5,
                                                    no = 1)),
          axis.title = element_text(color = "#000000"),
          axis.line = element_line(color = "#000000", size = axis.line.width),
          axis.ticks = element_line(color = "#000000", size = axis.line.width),
          title = element_text(color = "#000000"),
          legend.title = element_text(color = "#000000"),
          legend.text = element_text(color = "#000000"),
          legend.position = legend.position,
          legend.background = element_blank())

  # if show.mean == T add the mean spot
  if (show.mean == T) {
    if (mean.error.type != "none") {
      summary.plot.samples =
        summary.plot.samples +
        stat_summary(fun.data = ifelse(test = mean.error.type == "se",
                                       yes = "mean_se",
                                       no = "mean_sdl"),
                     col = mean.color,
                     size = mean.symbol.size,
                     show.legend = F,
                     shape = mean.symbol.shape,
                     geom = "pointrange",
                     position = dodge)
    } else {
      summary.plot.samples =
        summary.plot.samples +
        stat_summary(fun = "mean",
                     col = mean.color,
                     size = mean.symbol.size,
                     show.legend = F,
                     shape = mean.symbol.shape,
                     geom = "pointrange",
                     position = dodge)
    }

  }

  ##############################################################################
  # RETURN a list of data
  ## generate the table with regions (from the matrix, equivalent to the bed used) to add then to the full.stat.table

  # Bed files from original matrix
  regions.position = matrix.data[, 1:6]
  names(regions.position) = c("chr", "start", "end", "name", "score", "strand")

  # Repeating the regions n times as the number of samples
  regions.position = purrr::reduce(.x = replicate(n = length(unique(full.stat.table$sample)),
                                                  expr = regions.position,
                                                  simplify = F),
                                   .f = rbind)


  # Generation of density profile
  density.profile = Rseb::plot.density.profile(matrix.file = matrix.file,
                                               plot.by.group = plot.by.group,
                                               missing.data.as.zero = missing.data.as.zero,
                                               sample.names = sample.names,
                                               region.names = region.names,
                                               signal.type = "mean",
                                               error.type = error.type,
                                               plot.error = T,
                                               x.lim = subset.range,
                                               axis.line.width = axis.line.width,
                                               y.lim = y.lim,
                                               y.identical.auto = F,
                                               y.ticks.interval = ,
                                               y.digits = y.digits,
                                               text.size = text.size,
                                               legend.position = legend.position,
                                               plot.vertical.lines = T,
                                               write.reference.points = F,
                                               colors = colors,
                                               n.row.multiplot = n.row.multiplot,
                                               by.row = by.row,
                                               print.multiplot = F)

  # Build the output list
  return(list(data.table = cbind(regions.position, full.stat.table),
              metadata = metadata,
              plot.list = plot.list,
              density.profile = density.profile$multiplot,
              multiplot = multiplot,
              summary.plot.samples = summary.plot.samples,
              summary.plot.regions = summary.plot.groups,
              means.comparisons = dplyr::mutate(.data = purrr::reduce(comparisons.table, rbind),
                                                paired = stat.paired)))


} # End function

################################################################################
#----------------------------- END OF FUNCTION --------------------------------#
################################################################################
sebastian-gregoricchio/Rseb documentation built on Sept. 4, 2024, 1:59 p.m.