R/normalize_global.R

Defines functions normalize_global

Documented in normalize_global

#' Calculate Normalization Parameters and Apply Global Normalization
#'
#' Calculates normalization parameters based on the data using the specified
#' subset and normalization functions with option to apply the normalization to
#' the data.
#'
#' @param omicsData an object of the class 'pepData', 'proData', 'metabData',
#'   'lipidData', 'nmrData', created by \code{\link{as.pepData}},
#'   \code{\link{as.proData}}, \code{\link{as.metabData}},
#'   \code{\link{as.lipidData}}, \code{\link{as.nmrData}}, respectively. The
#'   function \code{\link{group_designation}} must have been run on omicsData to
#'   use several of the subset functions (i.e. rip and ppp_rip).
#' @param subset_fn character string indicating the subset function to use for
#'   normalization. See details for the current offerings.
#' @param norm_fn character string indicating the normalization function to use
#'   for normalization. See details for the current offerings.
#' @param params additional arguments passed to the specified subset function.
#'   See details for parameter specification and default values.
#' @param apply_norm logical argument indicating if the normalization should be
#'   applied to the data. Defaults to FALSE. If TRUE, the normalization is
#'   applied to the data and an S3 object of the same class as \code{omicsData}
#'   (e.g. 'pepData') with normalized values in \code{e_data} is returned. If
#'   FALSE, the normalization is not applied to the data and an S3 object of
#'   class 'normRes' is returned.
#' @param backtransform logical argument indicating if parameters for back
#'   transforming the data, after normalization, should be calculated. Defaults
#'   to FALSE. If TRUE, the parameters for back transforming the data after
#'   normalization will be calculated, and subsequently included in the data
#'   normalization if \code{apply_norm} is TRUE. See the details section for an 
#'   explanation of how these factors are calculated.
#' @param min_prop numeric threshold between 0 and 1 giving the minimum value
#'   for the proportion of biomolecules subset (rows of \code{e_data})
#' @param check.names deprecated
#'
#' @details Below are details for specifying function and parameter options.
#' @section Subset Functions: Specifying a subset function indicates the subset
#'   of biomolecules (rows of \code{e_data}) that should be used for computing
#'   normalization factors. The following are valid options: "all", "los",
#'   "ppp", "complete", "rip", and "ppp_rip". The option "all" is the subset
#'   that includes all biomolecules (i.e. no subsetting is done). The option 
#'   "los" identifies the subset of the biomolecules associated with the top 
#'   \code{L} order statistics, where \code{L} is a proportion between 0 and 1.
#'   Specifically, the biomolecules falling within the top \code{L} proportion of highest
#'   absolute abundance are retained for each sample, and the union of these
#'   biomolecules is taken as the subset identified (Wang et al., 2006). The option
#'   "ppp" (originally stands for percentage of peptides present) identifies the
#'   subset of biomolecules that are present/non-missing for a minimum
#'   \code{proportion} of samples (Karpievitch et al., 2009; Kultima et al.,
#'   2009). The option "complete" retains molecules with no missing data across
#'   all samples, equivalent to "ppp" with proportion = 1. The option "rip"
#'   identifies biomolecules with complete data that have a p-value greater than a
#'   defined threshold \code{alpha} (common values include 0.1 or 0.25) when
#'   subjected to a Kruskal-Wallis test based (non-parametric one-way ANOVA) on
#'   group membership (Webb-Robertson et al., 2011). The option "ppp_rip" is
#'   equivalent to "rip" however rather than requiring biomolecules with complete
#'   data, biomolecules with at least a \code{proportion} of non-missing values are
#'   subject to the Kruskal-Wallis test.
#'
#' @section Normalization Functions: Specifying a normalization function
#'   indicates how normalization scale and location parameters should be
#'   calculated. The following are valid options: "median", "mean", "zscore",
#'   and "mad". For median centering, the location estimates are the sample-wise
#'   medians of the subset data and there are no scale estimates. For mean
#'   centering, the location estimates are the sample-wise means of the subset
#'   data and there are no scale estimates. For z-score transformation, the
#'   location estimates are the subset means for each sample and the scale
#'   estimates are the subset standard deviations for each sample. For median
#'   absolute deviation (MAD) transformation, the location estimates are the
#'   subset medians for each sample and the scale estimates are the subset MADs
#'   for each sample.
#'
#' @section Specifying Subset Parameters Using the \code{params} Argument:
#'   Parameters for the chosen subset function should be specified in a list
#'   with the function specification followed by an equal sign and the desired
#'   parameter value. For example, if LOS with 0.1 is desired, one should use
#'   \code{params = list(los = 0.1)}. ppp_rip can be specified in one of two
#'   ways: specify the parameters with each separate function or combine using a
#'   nested list (e.g. \code{params = list(ppp_rip = list(ppp = 0.5, rip =
#'   0.2))}).
#'
#'   The following functions have parameters that can be specified:
#'   \tabular{ll}{ los \tab a value between 0 and 1 indicating the top
#'   proportion of order statistics. Defaults to 0.05 if unspecified. \cr \tab
#'   \cr ppp \tab a value between 0 and 1 specifying the proportion of samples
#'   that must have non-missing values for a biomolecule to be retained. Defaults to
#'   0.5 if unspecified. \cr \tab \cr rip \tab a value between 0 and 1
#'   specifying the p-value threshold for determining rank invariance. Defaults
#'   to 0.2 if unspecified. \cr \tab \cr ppp_rip \tab two values corresponding
#'   to the RIP and PPP parameters above. Defaults to 0.5 and 0.2, respectively.
#'   \cr }
#'
#' @section Backtransform: The purpose of back transforming data is to ensure
#'   values are on a scale similar to their raw values before normaliztion. The
#'   following values are calculated and/or applied for backtransformation
#'   purposes: \tabular{ll}{ \code{median} \tab scale is NULL and location
#'   parameter is a global median across all samples \cr \tab \cr \code{mean}
#'   \tab scale is NULL and location parameter is a global median across all
#'   samples \cr \tab \cr \code{zscore} \tab scale is pooled standard deviation
#'   and location is global mean across all samples \cr \tab \cr \code{mad} \tab
#'   scale is pooled median absolute deviation and location is global median
#'   across all samples \cr }
#'
#' @return If apply_norm is FALSE, an S3 object of type 'normRes' is returned.
#'   This object contains a list with: subset method, normalization method,
#'   normalization parameters, number of biomolecules used in normalization, and
#'   proportion of biomolecules used in normalization. plot() and summary() methods
#'   are available for this object. If apply_norm is TRUE, then the normalized
#'   data is returned in an object of the appropriate S3 class (e.g. pepData).
#'
#' @examplesIf requireNamespace("pmartRdata", quietly = TRUE)
#' library(pmartRdata)
#'
#' mymetab <- edata_transform(
#'   omicsData = metab_object,
#'   data_scale = "log2"
#' )
#' mymetab <- group_designation(
#'   omicsData = mymetab,
#'   main_effects = "Phenotype"
#' )
#' norm_object <- normalize_global(
#'   omicsData = mymetab,
#'   subset_fn = "all",
#'   norm_fn = "median"
#' )
#' norm_data <- normalize_global(
#'   omicsData = mymetab,
#'   subset_fn = "all",
#'   norm_fn = "median",
#'   apply_norm = TRUE,
#'   backtransform = TRUE
#' )
#'
#' @author Lisa Bramer
#'
#' @references Webb-Robertson BJ, Matzke MM, Jacobs JM, Pounds JG, Waters KM. A
#'   statistical selection strategy for normalization procedures in LC-MS
#'   proteomics experiments through dataset-dependent ranking of normalization
#'   scaling factors. Proteomics. 2011;11(24):4736-41.
#'
#' @export
#'
normalize_global <- function(omicsData, subset_fn, norm_fn, params = NULL,
                             apply_norm = FALSE, backtransform = FALSE,
                             min_prop = NULL, check.names = NULL) {
  if (!missing(check.names))
    warning("check.names parameter is deprecated")

  ## initial checks ##

  # Ensure omicsData is an appropriate class
  if (!inherits(omicsData, c(
    "pepData", "proData", "lipidData", "metabData",
    "nmrData"
  ))) {
    # Throw down an error for blatant abuse of the pmartR standards.
    stop(paste("omicsData must be of class 'pepData', 'proData', 'lipidData',",
      "'metabData' or 'nmrData'",
      sep = " "
    ))
  }

  # data should be log transformed #
  if (!(get_data_scale(omicsData) %in% c("log2", "log10", "log"))) {
    # Tell the user to read the documentation before running functions willy
    # nilly.
    stop(paste("omicsData$e_data should be log transformed prior to calling",
      "normalize_data. See documentation for the edata_transform",
      "function for more information.",
      sep = " "
    ))
  }

  # Inconceivable!
  if (!is.logical(apply_norm)) stop("apply_norm must be TRUE or FALSE.")
  if (!is.logical(backtransform)) stop("backtransform must be TRUE or FALSE.")

  # check that min_prop is greater than 0 but less than or equal to 1
  if (!is.null(min_prop)) {
    if (!is.numeric(min_prop)) {
      # Halt! You are trying to use a number as a character!!
      stop("min_prop must be numeric.")
    }

    if (min_prop < 0 || min_prop >= 1) {
      # Stop the user for not understanding what a proportion is.
      stop("min_prop must be greater than zero but less than or equal to 1")
    }
  }

  edata_id <- get_edata_cname(omicsData)
  samp_id <- get_fdata_cname(omicsData)

  # Use default normalization and subsetting if not specified
  if (missing(subset_fn)) stop("subset_fn wasn't specified")
  if (missing(norm_fn)) stop("norm_fn wasn't specified")

  # check for valid subset function choice #
  if (!(subset_fn %in% c("all", "los", "ppp", "rip", "ppp_rip", "complete"))) {
    # There is no hope!!!
    stop(paste(subset_fn, " is not a valid subset option", sep = ""))
  }

  # Check if subset_fn is "all" or "complete" and params is not NULL.
  if (subset_fn %in% c("all", "complete") && !is.null(params)) {
    # Throw an error because "all" and "complete" do not require any parameters
    # to be specified.
    stop(paste("The subset functions 'all' and 'complete' do not require",
      "params to be specified. Either set params = NULL or change",
      "subset_fn to another option.",
      sep = " "
    ))
  }

  # check for valid normalization function choice #
  if (!(norm_fn %in% c("mean", "median", "zscore", "mad"))) {
    # It is really quite simple to select one of the four options.
    stop(paste(norm_fn, " is not a valid normalization option", sep = ""))
  }

  # check for group information #
  grp_pres = length(grep("group_DF", names(attributes(omicsData))))

  # check that group designation was run if "rip" is involved
  if (subset_fn %in% c("rip", "ppp_rip") && grp_pres == 0) {
    stop(paste("group_designation() must be run on the data if subset_fn is",
      "'rip' or 'ppp_rip'",
      sep = " "
    ))
  }

  # check that parameter was specified for subset functions other than all #
  if (subset_fn != "all") {
    # ppp_rip #
    if (subset_fn == "ppp_rip") {
      # set default parameter values, will be overwritten if user specifies #
      params_ppp = 0.5
      params_rip = 0.2

      # check that parameters are either specified with "ppp_rip" or separate
      # options, if both are specified, error out #
      if (length(names(params)) > 1 && "ppp_rip" %in% names(params)) {
        stop(paste("Too many arguments in 'params'. Specify either",
          "'ppp_rip = ' or 'ppp = ' and 'rip = '",
          sep = " "
        ))
      }

      # Check if the parameters were specified with ppp. If it wasn't the
      # default value will be used.
      if ("ppp" %in% names(params)) {
        if (!is.numeric(params$ppp)) stop("The ppp parameter must be numeric.")

        # check that parameters are valid #
        if (params$ppp < 0 || params$ppp > 1) {
          # STOP!! JUST STOP!!!
          stop("The ppp parameter must be between 0 and 1")
        }

        # Change the ppp value to the one the user specified.
        # IF they got it right.
        params_ppp = params$ppp
      }

      # Check if the parameters were specified with rip. If it wasn't the
      # default value will be used.
      if ("rip" %in% names(params)) {
        if (!is.numeric(params$rip)) stop("The rip parameter must be numeric.")

        # Make sure rip is correctly specified. (You can't ever trust the user.)
        if (params$rip < 0 || params$rip > 1) {
          # *SIGH*
          stop("The rip parameter must be between 0 and 1")
        }

        # Change the rip value to the one the user specified.
        params_rip = params$rip
      }

      # Check if the parameters were specified using the ppp_rip option.
      if ("ppp_rip" %in% names(params)) {
        # check that at least one of the two parameters is specified #
        if (sum(c("ppp", "rip") %in% names(params[["ppp_rip"]])) < 1) {
          stop("Invalid parameter specification for 'ppp_rip'")
        }

        # Check if ppp was specified. If it wasn't the default value will be
        # used.
        if ("ppp" %in% names(params[["ppp_rip"]])) {
          if (!is.numeric(params[["ppp_rip"]]$ppp)) {
            stop("The ppp parameter must be numeric.")
          }

          # check that parameters are valid #
          if (params[["ppp_rip"]]$ppp < 0 || params[["ppp_rip"]]$ppp > 1) {
            stop("The ppp parameter must be between 0 and 1")
          }

          # Change the ppp value to the one the user specified.
          # IF they got it right.
          params_ppp = params[["ppp_rip"]]$ppp
        }

        # Check if rip was specified.
        if ("rip" %in% names(params)) {
          if (!is.numeric(params[["ppp_rip"]]$rip)) {
            stop("The rip parameter must be numeric.")
          }

          # Check that rip is a valid number.
          if (params[["ppp_rip"]]$rip < 0 || params[["ppp_rip"]]$rip > 1) {
            stop("The rip parameter must be between 0 and 1")
          }

          # Change the ppp value to the one the user specified.
          params_rip = params[["ppp_rip"]]$rip
        }
      }

      # An option other than "all" or "ppp_rip" was input.
    } else {
      # Set to NULL in case the user specified a subset function that requires a
      # value for params but they did NOT specify a value for param.
      param_val = NULL

      # Check if params was specified. If it was make sure it matches the subset
      # function and that it is an appropriate value.
      if (!is.null(params)) {
        # Make sure the appropriate parameters are present for the given the
        # subset function.
        if (!subset_fn %in% names(params)) {
          # The user is trying to mix apples and oranges again.
          stop(paste("The parameters specified in params do not match the",
            "subset function.",
            sep = " "
          ))
        }

        # Check if the name of the subset function is also the name specified in
        # the params argument.
        if (subset_fn %in% names(params)) {
          if (!is.numeric(params[[subset_fn]])) {
            stop("The specified parameter must be numeric.")
          }

          # Make sure they got the value right.
          if (params[[subset_fn]] < 0 || params[[subset_fn]] > 1) {
            stop("Specified parameter must be between 0 and 1")
          }

          # Set the value to the one the user input.
          param_val = params[[subset_fn]]
        }
      }
    }
  }

  # check that apply_norm is T/F #
  if (!inherits(apply_norm, "logical")) {
    # Stop the illogical user in their tracks!
    stop("apply_norm must be a logical argument")
  }

  # check that backtransform is T/F #
  if (!inherits(backtransform, "logical")) {
    # Stop the illogical user again!!
    stop("backtransform must a logical argument")
  }

  # Subset the data prior to normalization -------------------------------------

  # subset data using current subset method #
  if (subset_fn == "all") {
    # keep all biomolecule IDs.
    peps = all_subset(omicsData$e_data, edata_id)
  } else if (subset_fn == "los") {
    # Check if the user specified a cutoff value.
    if (!is.null(param_val)) {
      # Apply the users cutoff value.
      peps = los(omicsData$e_data, edata_id, param_val)

      # No cutoff value was specified by the user.
    } else {
      # Use the default value to subset.
      peps = los(omicsData$e_data, edata_id)
    }
  } else if (subset_fn == "rip") {
    # Extricate the group information.
    group_df = attr(omicsData, "group_DF")

    # Check if the user specified a cutoff value.
    if (!is.null(param_val)) {
      # Apply the users cutoff value.
      peps = rip(omicsData$e_data, edata_id, samp_id, group_df, param_val)

      # No cutoff value was specified by the user.
    } else {
      # Use the default value to subset.
      peps = rip(omicsData$e_data, edata_id, samp_id, group_df)
    }
  } else if (subset_fn == "complete") {
    # Only keep rows with non-missing values for all samples.
    peps = complete_mols(omicsData$e_data, edata_id)
  } else if (subset_fn == "ppp") {
    # Check if the user specified a cutoff value.
    if (!is.null(param_val)) {
      # Apply the users cutoff value.
      peps = ppp(omicsData$e_data, edata_id, param_val)
    } else {
      # Use the default value to subset.
      peps = ppp(omicsData$e_data, edata_id)
    }
  } else if (subset_fn == "ppp_rip") {
    # Separate the group attribute from the pack.
    group_df = attr(omicsData, "group_DF")

    # Use the parameters defined above (either the default values or the values
    # set by the user) to subset the data.
    peps = ppp_rip(omicsData$e_data,
      edata_id,
      samp_id,
      group_df,
      alpha = params_rip,
      proportion = params_ppp
    )
  }

  # Normalize the data ---------------------------------------------------------

  # Pull out the name of the normalization function from the input.
  fn_to_use <- switch(norm_fn,
    mean = mean_center,
    median = median_center,
    zscore = zscore_transform,
    mad = mad_transform
  )

  # Normalize the data according to the method selected.
  norm_results <- fn_to_use(
    e_data = omicsData$e_data,
    edata_id = edata_id,
    subset_fn = subset_fn,
    feature_subset = peps,
    backtransform = backtransform,
    apply_norm = apply_norm
  )

  # Calculate the proportion of features used to calculate the normalization
  # parameters.
  prop_features_calc <- length(peps) / nrow(omicsData$e_data)

  # If min_prop is specified check if it is below the actual proportion of
  # features used for normalizing.
  if (!is.null(min_prop) && (prop_features_calc < min_prop)) {
    # Throw an error because the proportion of features used for normalizing is
    # less than the threshold.
    stop(paste("The minimum proportion of biomolecules allowed (min_prop) was",
      "specified as", min_prop, "but the actual proportion of",
      "biomolecules used to calculate the normalization parameters using",
      "the given subset function (subset_fn) was",
      round(prop_features_calc, 3), "hence normalization was not",
      "carried out.",
      sep = " "
    ))
  }

  # Check if the normalization will be applied. If return all of the information
  # needed to normalize (which biomolecules to normalize with, normalization
  # method and parameters, ...).
  if (apply_norm == FALSE) {
    # Return all of the information needed to normalize the data. This will be
    # turned into its own class.
    res = list(
      subset_fn = subset_fn,
      norm_fn = norm_fn,
      parameters = list(
        normalization = norm_results$norm_params,
        backtransform = norm_results$backtransform_params
      ),
      n_features_calc = length(peps),
      feature_subset = peps,
      prop_features_calc = prop_features_calc
    )

    # Create a normRes object.
    class(res) = "normRes"

    # Add the input data as an attribute to the normRes object.
    attributes(res)$omicsData = omicsData

    # Return the normRes object.
    return(res)

    # This will run if apply_norm is TRUE.
  } else {
    # Update e_data with the normalized data.
    omicsData$e_data = norm_results$transf_data

    # Update the norm_info attribute.
    attributes(omicsData)$data_info$norm_info <- list(
      is_normalized = TRUE,
      norm_type = "global", # added 12/21/17 by KS
      subset_fn = subset_fn,
      subset_params = params,
      norm_fn = norm_fn,
      n_features_calc = length(peps),
      prop_features_calc = prop_features_calc,
      params = list(
        norm_scale = norm_results$norm_params$scale,
        norm_location = norm_results$norm_params$location,
        bt_scale = norm_results$backtransform_params$scale,
        bt_location = norm_results$backtransform_params$location
      )
    )
  }

  # Return the normalized omicsData object with all of its updated attributes!
  return(omicsData)
}
pmartR/pmartR documentation built on April 24, 2024, 10:22 p.m.