R/jam_groupstosedesign.R

#' Create SEDesign from experimental groups
#'
#' Create SEDesign from experimental groups
#'
#' This function creates `SEDesign` with appropriate design
#' and contrasts, based upon experimental groups.
#' This approach will use multiple experimental factors
#' to create appropriate one-way and n-way contrasts,
#' where each contrast represents a symmetric comparison
#' of each independent factor.
#'
#' Input can be provided in one of two ways:
#'
#' 1. `SummarizedExperiment` where experiment design is derived from
#' `SummarizedExperiment::colData()` of the `se` object, and
#' uses columns defined by `group_colnames`. This input should be
#' equivalent to providing a `data.frame` whose `rownames()` are
#' equal to `colnames(se)`.
#' 2. `data.frame` where each column represents a design factor.
#'
#'     * An example of `data.frame` input:
#'    ```R
#'    ifactors <- data.frame(
#'       treatment=c("Control", "Control", "Treated", "Treated"),
#'       genotype=c("Wildtype", "Knockout", "Wildtype", "Knockout"))
#'    ```
#'
#' 3. `character` vector, where design factor levels are separated
#' by a delimiter such as underscore `"_"`. This input will be
#' converted to `data.frame` before processing.
#'
#'    * An example of `character` input:
#'    ```R
#'    ifactors <- c(
#'       "Control_Wildtype",
#'       "Control_Knockout",
#'       "Treated_Wildtype",
#'       "Treated_Knockout")
#'    ```
#'
#' When rownames are provided in the `data.frame`, or names
#' are provided with a `character` vector, they are retained
#' and used as sample identifiers.
#'
#' Note:
#' This function will change any `"-"` in a factor name to
#' `"."` prior to detecting valid contrasts, in order to
#' prevent confusion and potential problems using the
#' contrast names in downstream analyses.
#' This step does not call `base::make.names()`, so that
#' step should be run beforehand if required.
#'
#' ## Troubleshooting
#'
#' * When this function returns no contrasts, or returns an unexpected
#' error during processing, it is most likely due to the limitation
#' of comparing one factor at a time. For example, the logic will
#' not define contrast `time1_treatment1-time2_treatment2`, because
#' this contrast changes two factors, it will only permit either
#' `time1_treatment1-time1_treatment2` or `time1_treatment1-time2_treatment1`.
#' * `max_depth` and `factor_order` are used to define the order in
#' which factors are compared, but do not affect the order of factors
#' used for things like group names.
#'
#' @return `SEDesign` object with the following slots:
#'    * `design`: `numeric` matrix with sample-to-group association
#'    * `contrasts`: `numeric` matrix with group-to-contrast association
#'    * `samples`: `character` vector that represents individual sample
#'    replicates, equivalent to `rownames()` of the `design` matrix.
#'
#'
#' @param ifactors `data.frame` or `character` vector.
#'    * When `data.frame` is supplied, each column is used as a
#'    design factor, and rownames are recognized as sample identifiers.
#'    * When `character` vector is supplied, it is converted to
#'    `data.frame` by splitting values with a delimiter
#'    `factor_sep`, and names are recognized as sample identifiers.
#' @param group_colnames `character` vector or `NULL`, used to
#'    define a subset of columns to use when `ifactors` is supplied
#'    as a `data.frame`. When `ifactors` is supplied as a `character`
#'    vector, this argument is used to define the `colnames`.
#' @param isamples `character` vector or `NULL`, optionally used to subset
#'    the sample identifiers used in subsequent steps. Note that only
#'    groups and contrasts that contain samples will be defined.
#' @param idesign `numeric` matrix or `NULL`, intended as an optional
#'    method to use an existing design matrix.
#' @param factor_order `integer` or `character` vector, used to define a
#'    specific order of factors when generating contrasts,  useful
#'    when there are multiple experimental factors.
#'    It can be helpful to force a secondary factor to be
#'    compared before a primary factor especially in two-way contrasts.
#'    Note that `factor_order` refers to the columns (factors) and not
#'    the factor levels (not column values).
#' @param omit_grep `character` regular expression pattern used to
#'    exclude secondary factors from contrasts.
#' @param max_depth `integer` value indicating the maximum depth of
#'    statistical contrasts to create. For example `max_depth=2` will
#'    allow two-way contrasts, and `max_depth=1` will only create
#'    one-way contrasts.
#' @param factor_sep `character` string used as a
#'    delimiter to separate experimental factors, when recognizing
#'    or creating experimental group names.
#' @param contrast_sep `character` string used as a
#'    delimiter to separate groups within each contrast name.
#' @param remove_pairs `list` of `character` vectors of factors
#'    that should not be compared. Each `character` vector should
#'    contain two factor levels for any given experimental factor,
#'    where those two factor levels should not be compared in
#'    the same pairwise contrast. For example, consider an experimental
#'    factor defined `treatment <- c("control", "dex", "compoundx")`.
#'    To prevent a direct comparison of `"dex"` to `"compoundx"`,
#'    use argument `remove_pairs=list(c("dex", "compoundx"))`.
#' @param make_unique `logical` indicating whether to make output
#'    contrasts unique.
#' @param pre_control_terms `character` vector used to
#'    place factor levels first in the order of levels, so these
#'    terms will be the denominator for contrasts. This approach
#'    is useful when the input `ifactors` does not already contain
#'    a `factor` with a specific order of factor levels.
#' @param add_contrastdf `data.frame` or `character` or `NULL`,
#'    intended to include a specific contrast in the output.
#'    This argument is typically used during iterative processing,
#'    and is not usually user-defined. It must contain
#' @param contrast_names `character` optional vector of specific
#'    contrasts to use when creating the contrast matrix. When
#'    `contrast_names=NULL` as default, the function defines contrasts
#'    using its internal logic. When `contrast_names` is supplied,
#'    only these `contrast_names` are used, with no other contrasts.
#' @param current_depth `integer` value used during iterative
#'    operations of this function.
#' @param rename_first_depth `logical` value used during iterative
#'    operations of this function.
#' @param return_sedesign `logical` used during iterative
#'    operations of this function. When `return_sedesign=FALSE`
#'    this function returns a `list`:
#'    * `"contrast_df"`: a `data.frame` as used in argument
#'    `add_contrastdf`, which describes each unique contrast.
#'    * `"contrast_names"`: a `character` vector of contrast names,
#'    which become `colnames()` of the contrast matrix.
#'    * `"idesign"`: a `numeric` design matrix as defined by the input data,
#'    suitable for debugging purposes for example.
#' @param verbose `logical` indicating whether to print verbose output.
#' @param ... additional arguments are ignored.
#'
#' @family jam experiment design
#'
#' @examples
#' # first define a vector of sample groups
#' igroups <- jamba::nameVector(paste(rep(c("WT", "KO"), each=6),
#'    rep(c("Control", "Treated"), each=3),
#'    sep="_"),
#'    suffix="_rep");
#' igroups <- factor(igroups, levels=unique(igroups));
#' igroups;
#'
#' sedesign <- groups_to_sedesign(igroups);
#' design(sedesign);
#' contrasts(sedesign);
#'
#' # plot the design and contrasts
#' plot_sedesign(sedesign)
#'
#' # the two-way contrasts can be visibly flipped, since they are equivalent
#' plot_sedesign(sedesign, flip_twoway=TRUE)
#'
#' # the design can be subset by sample
#' all_samples <- samples(sedesign)
#' subset_samples1 <- all_samples[-1:-3];
#' plot_sedesign(sedesign[subset_samples1, ])
#'
#' # the group n=# replicates are updated
#' subset_samples2 <- all_samples[c(-1, -6, -11)];
#' plot_sedesign(sedesign[subset_samples2, ])
#'
#' # The design * contrast matrix can be displayed in full
#' design(sedesign) %*%  contrasts(sedesign);
#'
#' # make "KO" the control term instead of "WT"
#' contrast_names(groups_to_sedesign(igroups, pre_control_terms=c("KO")))
#'
#' # change the order of factors compared
#' contrast_names(groups_to_sedesign(igroups, factor_order=2:1))
#'
#' # prevent comparisons of WT to WT, or KO to KO
#' sedesign_2 <- groups_to_sedesign(as.character(igroups),
#'    remove_pairs=list(c("WT"), c("KO")))
#' contrast_names(sedesign_2)
#' plot_sedesign(sedesign_2)
#'
#' # prevent comparisons of Treated to Treated, or Control to Control
#' sedesign_3 <- groups_to_sedesign(as.character(igroups),
#'    remove_pairs=list(c("Treated"), c("Control")))
#' contrast_names(sedesign_3)
#' plot_sedesign(sedesign_3)
#'
#' # input as a data.frame with ordered factor levels
#' ifactors <- data.frame(Genotype=factor(c("WT","WT","KO","KO"),
#'    levels=c("WT","KO")),
#'    Treatment=factor(c("Treated","Control"),
#'       levels=c("Control","Treated")))
#' # not necessary, but define rownames
#' rownames(ifactors) <- jamba::pasteByRow(ifactors);
#' ifactors;
#' contrast_names(groups_to_sedesign(ifactors))
#' plot_sedesign(groups_to_sedesign(ifactors))
#'
#' # you can still override factor levels with pre_control_terms
#' plot_sedesign(groups_to_sedesign(ifactors, pre_control_terms=c("KO")))
#'
#' # input as design matrix
#' design_matrix <- design(groups_to_sedesign(ifactors))
#' design_matrix
#' contrast_names(groups_to_sedesign(design_matrix))
#'
#' # again the "KO" group can be the control by using pre_control_terms
#' contrast_names(groups_to_sedesign(design_matrix, pre_control_terms="KO"))
#'
#' # default_order="asis"
#' # contrasts show A-B, because B appears fist
#' # contrasts show Untreated-Treated because Treated appears first
#' df_test <- data.frame(
#'    set=c("B", "B", "A", "A"),
#'    treat=c("Treated", "Untreated"))
#' plot_sedesign(groups_to_sedesign(df_test))
#' plot_sedesign(groups_to_sedesign(jamba::pasteByRow(df_test)))
#'
#' # default_order="sort_samples"
#' # contrasts show B-A, because A is sorted first
#' # contrasts show Treated-Untreated because sort_samples()
#' #    determines "Untreated" is a preferred control term
#' plot_sedesign(groups_to_sedesign(df_test,
#'    default_order="sort_samples"))
#'
#' # default_order="mixedSort"
#' # contrasts show B-A, because A is sorted first
#' # contrasts show Untreated-Treated because Treated is sorted first
#' plot_sedesign(groups_to_sedesign(df_test,
#'    default_order="mixedSort"))
#' plot_sedesign(groups_to_sedesign(df_test,
#'    default_order="mixedSort",
#'    pre_control_terms=c("Untreated")))
#'
#' @export
groups_to_sedesign <- function
(ifactors,
 group_colnames=NULL,
 isamples=NULL,
 idesign=NULL,
 factor_order=NULL,
 omit_grep="[-,]",
 max_depth=2,
 factor_sep="_",
 contrast_sep="-",
 remove_pairs=NULL,
 pre_control_terms=NULL,
 add_contrastdf=NULL,
 contrast_names=NULL,
 current_depth=1,
 rename_first_depth=TRUE,
 return_sedesign=TRUE,
 default_order=c("asis",
    "sort_samples",
    "mixedSort"),
 verbose=FALSE,
 ...)
{
   ## Purpose is to take a data.frame, whose rows are groups,
   ## and whose columns are factors with factor levels as column values,
   ## and generate pairwise contrast names where only one factor changes
   ## at a time
   ##
   ## ifactors can be one of the following:
   ##
   ## - data.frame whose columns represent each statistical factor,
   ## whose values are either character, numeric, or factor, the latter
   ## can be ordered in order to provide preference to control groups.
   ##
   ## - vector of character strings representing each group,
   ## where the factors are separated by factor_sep, e.g. "WT_Dex", "NT_Veh"
   ##
   ## - idesign matrix whose colnames represent group names, and rownames
   ## represent samples present in those groups.
   ##
   ## - allNorm list object, with "targets" containing a data.frame of sample
   ## annotations, and group_colnames defines the columns to use for grouping.
   ##
   ## - remove_pairs is a list of vectors, where each vector is expected to
   ## contain two elements representing two factor levels not to be compared.
   ## For example, an experiment with Control, NTC, Vehicle, Dex, might not
   ## want to compare NTC-Control, Vehicle-Control, Dex-Control,
   ## remove_pairs <- list(c("NTC","Control"),c("Vehicle","Control"),c("Dex","Control"));
   ##
   ## TODO: enable remove_pairs to filter out contrasts after they are defined,
   ## for example c("NTC,Control", "d0") would remove the contrast NTC_d0-Control-d0
   ##
   ## makeUnique=TRUE will only return one entry for each set of factors compared,
   ## which will remove cases where factor 2 is tested, then factor 1 tested as an
   ## interaction; if factor 1 and factor 2 are already represented in another
   ## interaction contrast.
   ##
   ## Ultimately a table of experiment design is created, with number of columns
   ## equal to the number of factors. By default the contrasts are applied for
   ## each factor in order of colnames, but factor_order can be used to specify
   ## a custom order. This change can affect the way two-way contrasts are
   ## defined, by forcing the first/internal contrast to use a particular
   ## factor. In theory the result is simply aesthetic, as the underlying
   ## significance of the two-way comparison will be identical. But if not
   ## for aesthetics, what are we doing?
   ##
   ## TODO: fix issue when one column contains numeric values instead of
   ## character or factor, e.g. when "Time" contains c(15,45).
   ## One solution is convert to factor, then proceed.
   if (!suppressPackageStartupMessages(require(limma))) {
      stop("limma is required for groups_to_sedesign()).");
   }
   sample2group <- NULL;

   # validate default_order
   default_order <- match.arg(default_order);

   ## Handle remove_pairs by expanding to both orientations of contrast
   if (!is.null(remove_pairs)) {
      if (!is.list(remove_pairs)) {
         stop("remove_pairs must be a list of 1- or 2-member character vectors");
      }
      remove_pairsFull <- jamba::cPasteS(remove_pairs)
      if (verbose >= 2) {
         jamba::printDebug("groups_to_sedesign(): ",
            "remove_pairsFull:");
         print(remove_pairsFull);
      }
   }

   ## Special case where one data.frame column is sent, which is delimited.
   ## Mainly we treat as a vector, except that we keep the rownames
   ## so we can derive isamples.
   if ("SummarizedExperiment" %in% class(ifactors)) {
      se <- ifactors;
      if (length(isamples) == 0) {
         isamples <- colnames(se)
      }
      group_colnames <- intersect(group_colnames,
         colnames(SummarizedExperiment::colData(se)));
      if (length(group_colnames) > 0) {
         ifactors <- data.frame(check.names=FALSE,
            stringsAsFactors=FALSE,
            SummarizedExperiment::colData(se)[,group_colnames, drop=FALSE])
         rownames(ifactors) <- colnames(se);
         ifactors <- ifactors[match(isamples, rownames(ifactors)),,drop=FALSE];
      }
      if (verbose) {
         jamba::printDebug("groups_to_sedesign(): ",
            indent=(current_depth-1)*5,
            "ifactors from SummarizedExperiment input:");
         print(ifactors);
      }
   }
   if (jamba::igrepHas("data.frame|matrix", class(ifactors)) &&
         ncol(ifactors) == 1) {
      ifactors <- jamba::nameVector(ifactors[,1], rownames(ifactors));
   }

   if (jamba::igrepHas("factor|character", class(ifactors))) {
      #####################################################
      # Vector input
      #
      if (verbose) {
         jamba::printDebug("groups_to_sedesign(): ",
            indent=(current_depth-1)*5,
            "splitting vector into groups");
      }
      if (length(names(ifactors)) == 0) {
         if (length(isamples) == 0) {
            ## Create isamples
            isamples <- jamba::makeNames(rep("sample", length(ifactors)));
            names(ifactors) <- isamples;
         } else if (length(isamples) != length(ifactors)) {
            stop(paste0("length(isamples) must be equal length(ifactors) ",
               "when there are no names(ifactors)."));
         }
         names(ifactors) <- isamples;
      } else if (length(isamples) == 0) {
         isamples <- names(ifactors);
      } else {
         if (!any(isamples %in% names(ifactors)) && length(isamples) == length(ifactors)) {
            ## Use isamples as-is
            names(ifactors) <- isamples;
         } else if (!all(isamples %in% names(ifactors))) {
            stop(paste0("isamples is present in some not not all names(ifactors). ",
               "isamples must either: all be present in names(ifactors); or ",
               "present in none of names(ifactors) and length(isamples) == length(ifactors)."))
         } else {
            ## Re-order ifactors to match isamples
            ifactors <- ifactors[match(isamples, names(ifactors))];
         }
      }
      if (jamba::igrepHas("factor", class(ifactors))) {
         # Convert factor to a data.frame where each column
         # is a factor with ordered levels that match the order
         # the factor levels appear in the original factor.
         iFactorsL <- strsplitOrdered(ifactors, factor_sep);
         names(iFactorsL) <- names(ifactors);
         iFactorsLevels <- levels(iFactorsL[[1]]);
         ifactors <- data.frame(check.names=FALSE,
            stringsAsFactors=FALSE,
            jamba::rbindList(
               strsplit(as.character(ifactors),
                  factor_sep)));
         rownames(ifactors) <- names(iFactorsL);
         for (i in seq_len(ncol(ifactors))) {
            factor_levels <- intersect(iFactorsLevels, ifactors[,i]);
            if (length(pre_control_terms) > 0) {
               # when pre_control_terms are supplied, sort them first
               factor_levels <- unique(c(
                  intersect(pre_control_terms, factor_levels),
                  factor_levels))
            }
            if (verbose) {
               jamba::printDebug("groups_to_sedesign(): ",
                  indent=(current_depth-1)*5,
                  "factor_levels (", i, "): ", factor_levels);
            }
            ifactors[,i] <- factor(ifactors[,i],
               levels=factor_levels);
         }
      } else {
         # split into data.frame
         ifactors <- data.frame(check.names=FALSE,
            stringsAsFactors=FALSE,
            jamba::rbindList(strsplit(ifactors, factor_sep)));
         # Convert each column to factor for proper sort order
         for (iCol in seq_len(ncol(ifactors))) {
            if ("asis" %in% default_order) {
               factor_levels <- unique(ifactors[,iCol]);
               if (length(pre_control_terms) > 0) {
                  # when pre_control_terms are supplied, sort them first
                  factor_levels <- unique(c(
                     intersect(pre_control_terms, factor_levels),
                     factor_levels))
               }
            } else if ("sort_samples" %in% default_order) {
               factor_levels <- sort_samples(unique(ifactors[[iCol]]),
                  pre_control_terms=pre_control_terms,
                  ...)
            } else {
               factor_levels <- jamba::mixedSort(unique(ifactors[[iCol]]),
                  ...);
               if (length(pre_control_terms) > 0) {
                  factor_levels <- unique(c(
                     intersect(pre_control_terms, factor_levels),
                     factor_levels));
               }
            }
            if (verbose) {
               jamba::printDebug("groups_to_sedesign(): ",
                  indent=(current_depth-1)*5,
                  "factor_levels (", iCol, "): ", factor_levels);
            }
            ifactors[,iCol] <- factor(ifactors[,iCol],
               levels=factor_levels);
         }
      }
      if (length(group_colnames) > 0) {
         colnames(ifactors) <- jamba::makeNames(rep(group_colnames,
            length.out=ncol(ifactors)),
            renameFirst=FALSE);
      } else {
         colnames(ifactors) <- jamba::makeNames(
            rep("factor",
               length.out=ncol(ifactors)),
            renameOnes=TRUE);
      }
      if (length(rownames(ifactors)) == 0) {
         rownames(ifactors) <- jamba::makeNames(
            jamba::pasteByRow(ifactors, sep=factor_sep),
            suffix="_rep");
      }
      if (verbose) {
         jamba::printDebug("groups_to_sedesign(): ",
            indent=(current_depth-1)*5,
            "head(ifactors, 40) as recognized:");
         print(head(ifactors, 40));
      }

      # Assume sample rows and group columns
      rowGroups <- jamba::pasteByRowOrdered(ifactors, sep=factor_sep);
      sample2group <- split(rownames(ifactors), rowGroups);
      if (length(idesign) == 0) {
         idesign <- list2im_opt(sample2group, do_sparse=FALSE)[
            rownames(ifactors), levels(rowGroups),drop=FALSE];
      }
   } else if (jamba::igrepHas("data.frame|dataframe|tbl", class(ifactors)) ||
         ("matrix" %in% class(ifactors) & !is.numeric(ifactors))) {
      #####################################################
      # data.frame input, or matrix with non-numeric data
      #
      if (verbose) {
         jamba::printDebug("groups_to_sedesign(): ",
            indent=(current_depth-1)*5,
            "using existing data.frame");
      }
      if (length(rownames(ifactors)) == 0) {
         if (length(isamples) == 0) {
            # Create isamples
            isamples <- jamba::makeNames(rep("sample", nrow(ifactors)));
         } else if (length(isamples) == nrow(ifactors)) {
            # use isamples as-is
         } else {
            stop(paste0("ifactors has no rownames, and ",
               "length(isamples) != nrow(ifactors). ",
               "Please make length(isamples) == nrow(iFactor)"));
         }
      } else {
         if (length(isamples) == 0) {
            isamples <- rownames(ifactors);
         } else {
            if (!any(isamples %in% ifactors) && length(isamples) == nrow(ifactors)) {
               ## use isamples as-is
            } else if (!all(isamples %in% rownames(ifactors))) {
               stop(paste0("isamples is not present in all rownames(ifactors). ",
                  "Either: all isamples must be present in rownames(ifactors); or ",
                  "no isamples are present in rownames(ifactors) and ",
                  "length(isamples) == nrow(ifactors)."));
            } else {
               ## Subset or re-order ifactors using matching isamples
               ifactors <- ifactors[match(isamples, rownames(ifactors)),,drop=FALSE];
               if (verbose) {
                  jamba::printDebug("groups_to_sedesign(): ",
                     indent=(current_depth-1)*5,
                     "Specifying ifactors[isamples,]");
                  print(head(ifactors));
               }
            }
         }
         if (verbose >= 2) {
            jamba::printDebug("groups_to_sedesign(): ",
               indent=(current_depth-1)*5,
               "head(ifactors):");
            print(head(ifactors, 100));
         }
      }
      if (length(group_colnames) == 0) {
         if (length(colnames(ifactors)) == 0) {
            ## Create colnames
            group_colnames <- jamba::makeNames(
               renameOnes=TRUE,
               rep("factor",
                  length.out=ncol(ifactors)));
            colnames(ifactors) <- group_colnames;
         } else {
            group_colnames <- colnames(ifactors);
         }
      } else {
         if (!all(group_colnames %in% colnames(ifactors))) {
            stop(paste0("Not all group_colnames are in colnames(ifactors), please remedy."));
         }
         ## Use ifactors as-is
         #ifactors <- ifactors[,group_colnames,drop=FALSE];
      }
      if (verbose) {
         # jamba::printDebug("groups_to_sedesign(): ",
         #    indent=(current_depth-1)*5,
         #    "Specifying ifactors[, group_colnames, drop=FALSE]");
         jamba::printDebug("groups_to_sedesign(): ",
            indent=(current_depth-1)*5,
            "group_colnames:",
            group_colnames);
      }

      # default_order == "asis" will convert character columns to factor
      #    using the observed order of terms as factor levels
      for (icol in group_colnames) {
         if ("factor" %in% class(ifactors[, icol])) {
            factor_levels <- levels(ifactors[, icol]);
            if (length(pre_control_terms) > 0) {
               factor_levels <- unique(c(
                  intersect(pre_control_terms, factor_levels),
                  factor_levels));
               ifactors[, icol] <- factor(ifactors[, icol],
                  levels=factor_levels,
                  exclude=NULL);
            }
            if (verbose) {
               jamba::printDebug("groups_to_sedesign(): ",
                  indent=(current_depth-1)*5,
                  "factor_levels (", icol, "): ", factor_levels);
            }
         } else {
            if ("asis" %in% default_order) {
               if (verbose) {
                  jamba::printDebug("groups_to_sedesign(): ",
                     indent=(current_depth-1)*5,
                     "Converting '", icol, "' to factor using default_order: ", "asis");
               }
               factor_levels <- unique(ifactors[,icol]);
               if (length(pre_control_terms) > 0) {
                  factor_levels <- unique(c(
                     intersect(pre_control_terms, factor_levels),
                     factor_levels));
               }
            } else if ("sort_samples" %in% default_order) {
               if (verbose) {
                  jamba::printDebug("groups_to_sedesign(): ",
                     indent=(current_depth-1)*5,
                     "Converting '", icol, "' to factor using default_order: ", "sort_samples");
               }
               factor_levels <- sort_samples(unique(ifactors[,icol]),
                  pre_control_terms=pre_control_terms,
                  ...);
            } else {
               if (verbose) {
                  jamba::printDebug("groups_to_sedesign(): ",
                     indent=(current_depth-1)*5,
                     "Converting '", icol, "' to factor using default_order: ", "mixedSort");
               }
               factor_levels <- jamba::mixedSort(unique(ifactors[, icol]),
                  ...);
               if (length(pre_control_terms) > 0) {
                  factor_levels <- unique(c(
                     intersect(pre_control_terms, factor_levels),
                     factor_levels));
               }
            }
            if (verbose) {
               jamba::printDebug("groups_to_sedesign(): ",
                  indent=(current_depth-1)*5,
                  "factor_levels (", icol, "): ", factor_levels);
            }
            ifactors[, icol] <- factor(ifactors[, icol],
               levels=factor_levels,
               exclude=NULL);
         }
      }
      # default_order == "mixedSort" will use alphanumeric sort
      # jamba::mixedSortDF() will honor factor level orders when present,
      # otherwise will use alphanumeric sort order.
      # To influence the sort order, use factors with ordered levels.
      ifactors <- jamba::mixedSortDF(ifactors,
         honorFactor=TRUE,
         byCols=group_colnames);
      if (verbose >= 2) {
         jamba::printDebug("groups_to_sedesign(): ",
            indent=(current_depth-1)*5,
            "ifactors:");
         print(head(ifactors));
      }

      ## rowGroups is the unique set of group names, used to keep the original order
      #rowGroups <- jamba::pasteByRowOrdered(ifactors[,group_colnames,drop=FALSE],
      #   sep=factor_sep);
      ## Unclear whether to re-order columns to match group_colnames, for now we do not
      rowGroups <- jamba::pasteByRowOrdered(ifactors,
         sep=factor_sep);
      if (length(rownames(ifactors)) == 0) {
         iFactors_names <- jamba::makeNames(rowGroups,
            suffix="_rep");
         rownames(ifactors) <- iFactors_names;
      } else {
         iFactors_names <- rownames(ifactors);
      }
      ## Assume for now sample rows and group columns
      sample2group <- split(iFactors_names, rowGroups);
      if (length(idesign) == 0) {
         idesign <- list2im_opt(sample2group,
            do_sparse=FALSE)[iFactors_names, as.character(unique(rowGroups)), drop=FALSE];
         if (all(isamples %in% iFactors_names)) {
            idesign <- idesign[match(isamples, iFactors_names),,drop=FALSE];
         }
      } else {
         if (length(isamples) > 0) {
            idesign <- idesign[match(isamples, rownames(idesign)),,drop=FALSE];
         }
      }
   } else if ("matrix" %in% class(ifactors) && all(c(0,1) %in% ifactors)) {
      ##################################
      # design matrix input
      #
      if (verbose) {
         jamba::printDebug("groups_to_sedesign(): ",
            indent=(current_depth-1)*5,
            "converting idesign into ifactors data.frame");
      }
      ## Assume for now, idesign matrix with sample rows and group columns
      sample2group <- split(rownames(ifactors),
         sapply(seq_len(nrow(ifactors)), function(i){
            colnames(ifactors)[which(ifactors[i,] != 0)];
         }));
      idesign <- list2im_opt(sample2group, do_sparse=FALSE)[
         rownames(ifactors),names(sample2group)];
      iFactorsCols <- colnames(ifactors);
      ifactors <- data.frame(check.names=FALSE,
         stringsAsFactors=FALSE,
         jamba::rbindList(strsplit(iFactorsCols, factor_sep)));
      if (!is.null(group_colnames)) {
         colnames(ifactors) <- jamba::makeNames(
            rep(group_colnames, length.out=ncol(ifactors)),
            renameFirst=FALSE);
      } else {
         colnames(ifactors) <- jamba::makeNames(
            rep("groupFactor",
               length.out=ncol(ifactors)),
            renameOnes=TRUE,
            suffix="_");
      }
      group_colnames <- colnames(ifactors);
      rownames(ifactors) <- unname(
         jamba::pasteByRow(ifactors, sep=factor_sep));

      # define factor order
      for (icol in group_colnames) {
         if ("asis" %in% default_order) {
            if (verbose) {
               jamba::printDebug("groups_to_sedesign(): ",
                  indent=(current_depth-1)*5,
                  "Converting '", icol, "' to factor using default_order: ", "asis");
            }
            factor_levels <- unique(ifactors[,icol]);
            if (length(pre_control_terms) > 0) {
               factor_levels <- unique(c(
                  intersect(pre_control_terms, factor_levels),
                  factor_levels));
            }
         } else if ("sort_samples" %in% default_order) {
            if (verbose) {
               jamba::printDebug("groups_to_sedesign(): ",
                  indent=(current_depth-1)*5,
                  "Converting '", icol, "' to factor using default_order: ", "sort_samples");
            }
            factor_levels <- sort_samples(unique(ifactors[,icol]),
               pre_control_terms=pre_control_terms,
               ...);
         } else {
            if (verbose) {
               jamba::printDebug("groups_to_sedesign(): ",
                  indent=(current_depth-1)*5,
                  "Converting '", icol, "' to factor using default_order: ", "mixedSort");
            }
            factor_levels <- jamba::mixedSort(unique(ifactors[, icol]),
               ...);
            if (length(pre_control_terms) > 0) {
               factor_levels <- unique(c(
                  intersect(pre_control_terms, factor_levels),
                  factor_levels));
            }
         }
         if (verbose) {
            jamba::printDebug("groups_to_sedesign(): ",
               indent=(current_depth-1)*5,
               "factor_levels (", icol, "): ", factor_levels);
         }
         ifactors[, icol] <- factor(ifactors[, icol],
            levels=factor_levels,
            exclude=NULL);
      }

      if (verbose) {
         jamba::printDebug("ifactors:",
            indent=(current_depth-1)*5)
         print(ifactors);
      }
   }
   if (verbose >= 2) {
      jamba::printDebug("groups_to_sedesign(): ",
         indent=(current_depth-1)*5,
         "ifactors:");
      print(head(ifactors));
      if (!is.null(sample2group)) {
         jamba::printDebug("sample2group:",
            indent=(current_depth-1)*5)
         print(head(sample2group));
      }
   }

   ##########################################################
   ## Check to make sure no factor levels contain "-"
   for (i in colnames(ifactors)) {
      if (jamba::igrepHas("-", ifactors[,i])) {
         ifactors[,i] <- gsub("-", ".", ifactors[,i]);
      }
   }

   ##########################################################
   ## First check to make sure the ifactors values are unique
   ## and if not, use only unique entries
   iContrastGroupsUse <- colnames(ifactors);
   iFactorsV <- jamba::pasteByRow(ifactors, sep=factor_sep);
   iKeepRows <- match(unique(iFactorsV), iFactorsV);
   ifactors <- ifactors[iKeepRows,,drop=FALSE];
   if (rename_first_depth && current_depth==1) {
      rownames(ifactors) <- jamba::pasteByRow(ifactors, sep=factor_sep);
   }

   if (verbose >= 2) {
      jamba::printDebug("groups_to_sedesign(): ",
         indent=(current_depth-1)*5,
         "ifactors:");
      print(head(ifactors));
   }


   if (verbose) {
      jamba::printDebug("groups_to_sedesign(): ",
         indent=(current_depth-1)*5,
         "current_depth:",
         current_depth);
      jamba::printDebug("groups_to_sedesign(): ",
         indent=(current_depth-1)*5,
         "return_sedesign: ", return_sedesign)
   }

   ##########################################################
   ## Iterate each factor in order, and create valid contrasts
   ## Note: we allow applying contrasts in a different order than the
   ## columns in iFactor, if !is.null(factor_order)
   ##
   # ensure factor_order only matches columns provided
   if (length(contrast_names) == 0) {
      factor_order <- factor_order[factor_order <= ncol(ifactors)]
      if (length(factor_order) == 0) {
         factor_order <- seq_along(colnames(ifactors));
      }
      # ensure max_depth is no larger than the number of factors
      max_depth <- min(c(max_depth, length(factor_order)))

      ##
      if (verbose) {
         jamba::printDebug("groups_to_sedesign(): ",
            indent=(current_depth-1)*5,
            "factor_order values:",
            colnames(ifactors)[factor_order]);
      }
      ###################################
      # Define iContrastNames
      iContrastNames <- data.frame(check.names=FALSE,
         stringsAsFactors=FALSE,
         jamba::rbindList(lapply(factor_order, function(iChange){
            if (verbose) {
               jamba::printDebug("groups_to_sedesign(): ",
                  indent=(current_depth-1)*5,
                  "factor_order iChange:",
                  colnames(ifactors)[iChange]);
            }
            iNoChange <- setdiff(seq_len(ncol(ifactors)), iChange);
            ## Optionally omit certain values from consideration,
            ## notably for "," or "-" which already contain changing factors
            iFactorUseRows <- jamba::unigrep(omit_grep, ifactors[,iChange]);

            if (length(iNoChange) == 0) {
               iSplit <- rep("", length(iFactorUseRows));
            } else {
               iSplit <- jamba::pasteByRowOrdered(ifactors[iFactorUseRows,iNoChange,drop=FALSE],
                  sep=factor_sep);
            }

            ## Split rows by constant values in non-changing factor columns
            iSplitL <- split(iFactorUseRows, iSplit);
            iSplitL <- iSplitL[lengths(iSplitL) > 1];
            ## Only consider contrasts when there are multiple rows
            if (length(iSplitL) > 0) {
               iDF <- jamba::rbindList(lapply(iSplitL, function(iSplitRows) {
                  use_factor_order <- unique(c(factor_order,
                     seq_len(ncol(ifactors))));
                  iFactorsSub <- ifactors[iSplitRows, use_factor_order, drop=FALSE];
                  if (verbose >= 2) {
                     jamba::printDebug("groups_to_sedesign(): ",
                        indent=(current_depth-1)*5,
                        "   iSplitRows:",
                        iSplitRows,
                        ", use_factor_order:", use_factor_order);
                     jamba::printDebug("groups_to_sedesign(): ",
                        indent=(current_depth-1)*5,
                        "   iFactorsSub:");
                     print(iFactorsSub);
                  }
                  iFactorVals <- iFactorsSub[,colnames(ifactors)[iChange]];
                  iMatch <- match(
                     sort_samples(iFactorVals,
                        pre_control_terms=pre_control_terms),
                     iFactorVals);
                  # 0.0.27.900: fix for one factor column input
                  if (length(iMatch) < 2) {
                     return(NULL)
                  }
                  iCombn <- combn(iMatch, 2);
                  iGrp1 <- ifelse(grepl("-", rownames(iFactorsSub)[iCombn[2,]]),
                     paste0("(", rownames(iFactorsSub)[iCombn[2,]], ")"),
                     rownames(iFactorsSub)[iCombn[2,]]);
                  iGrp2 <- ifelse(grepl("-", rownames(iFactorsSub)[iCombn[1,]]),
                     paste0("(", rownames(iFactorsSub)[iCombn[1,]], ")"),
                     rownames(iFactorsSub)[iCombn[1,]]);
                  iContrastName <- paste0(iGrp1, "-", iGrp2);
                  icondf <- iFactorsSub[intercalate(iCombn[2,], iCombn[1,]),,drop=FALSE];
                  iconfac <- factor(rep(iContrastName, each=2),
                     levels=unique(iContrastName));
                  iContrastDF <- data.frame(check.names=FALSE,
                     stringsAsFactors=FALSE,
                     lapply(jamba::nameVector(colnames(icondf)), function(i){
                        jamba::cPasteU(split(icondf[,i], iconfac))
                     }),
                     contrastName=iContrastName,
                     row.names=iContrastName);

                  # Create a string representing the combination of factors.
                  # which we will use to prevent re-creating the same contrasts.
                  #
                  # Modified the string to include colname, to ensure that two
                  # factors which may share some levels, will not be confused.
                  iContrastDF[,"contrastString"] <- jamba::pasteByRow(
                     iContrastDF[,colnames(iFactorsSub),drop=FALSE],
                     includeNames=TRUE,
                     sep=";",
                     sepName=":");
                  iContrastDF;
               }));
               rownames(iDF) <- iDF[,"contrastName"];
               if (verbose) {
                  jamba::printDebug("groups_to_sedesign(): ",
                     indent=(current_depth-1)*5,
                     "   new contrasts:\n",
                     rownames(iDF),
                     sep=",\n");
               }
               iDF;
            } else {
               NULL;
            }
         })));
      # Define iContrastNames (end)
      ###################################

      ## Optionally spike in some pre-defined non-standard contrasts
      if (!is.null(add_contrastdf)) {
         if (verbose) {
            jamba::printDebug("groups_to_sedesign(): ",
               indent=(current_depth-1)*5,
               "Adding custom ",
               "add_contrastdf");
         }
         iContrastNames <- rbind(iContrastNames, add_contrastdf);
      }

      # Always make each row unique in terms of the factors compared.
      # Note: This step enforces order of comparison in two-way contrasts.
      # if (make_unique) {
      if (TRUE) {
         iDFcomponents <- jamba::pasteByRow(
            iContrastNames[,setdiff(colnames(iContrastNames), "contrastName"),drop=FALSE],
            sep="!");
         if (verbose >= 2) {
            jamba::printDebug("groups_to_sedesign(): ",
               indent=(current_depth-1)*5,
               "iDFcomponents:\n",
               iDFcomponents, sep="\n");
            jamba::printDebug("groups_to_sedesign(): ",
               "unique(iDFcomponents):\n",
               unique(iDFcomponents), sep="\n");
         }
         if (verbose && any(duplicated(iDFcomponents))) {
            dupe_comps <- iDFcomponents[duplicated(iDFcomponents)];
            dupe_kept_df <- data.frame(
               stringsAsFactors=FALSE,
               dupe_comp=iDFcomponents[iDFcomponents %in% dupe_comps],
               contrast=rownames(subset(iContrastNames, iDFcomponents %in% dupe_comps)),
               outcome=ifelse(!duplicated(iDFcomponents[iDFcomponents %in% dupe_comps]), "(kept)", "(removed)"))
            dupe_kept_df <- jamba::mixedSortDF(byCols=1, dupe_kept_df);
            jamba::printDebug("groups_to_sedesign(): ",
               indent=(current_depth-1)*5,
               "   removed duplicate (equivalent) contrasts:");
            print(dupe_kept_df[,-1, drop=FALSE]);
         }
         iContrastNames <- subset(iContrastNames, !duplicated(iDFcomponents));
      }

      if ("contrastName" %in% colnames(iContrastNames)) {
         if (verbose >= 2) {
            jamba::printDebug("groups_to_sedesign(): ",
               indent=(current_depth-1)*5,
               "tcount(iContrastNames$contrastName):")
            print(jamba::tcount(iContrastNames[,"contrastName"]));
         }
         rownames(iContrastNames) <- jamba::makeNames(iContrastNames[,"contrastName"]);
      }

      # Optionally remove contrasts with factor pairs in remove_pairs
      if (length(remove_pairs) > 0) {
         if (verbose) {
            jamba::printDebug("groups_to_sedesign(): ",
               indent=(current_depth-1)*5,
               "Processing any remove_pairs contrasts.");
         }
         for (iCol in setdiff(colnames(iContrastNames), "contrastName")) {
            if (verbose) {
               jamba::printDebug("groups_to_sedesign(): ",
                  indent=(current_depth-1)*5,
                  "   Checking for remove_pairs in column:", iCol);
            }
            iColVals <- jamba::cPasteS(strsplit(as.character(iContrastNames[[iCol]]), ","));
            if (any(iColVals %in% remove_pairsFull)) {
               iWhich1 <- which(iColVals %in% remove_pairsFull);
               iWhich <- which(!iColVals %in% remove_pairsFull);
               if (verbose) {
                  jamba::printDebug("      removedPair with values:\n",
                     indent=(current_depth-1)*5,
                     unique(iColVals[iWhich1]),
                     fgText=c("yellow", "purple"), sep="\n");
               }
               iContrastNames <- iContrastNames[iWhich,,drop=FALSE];
            }
         }
         if (nrow(iContrastNames) == 0) {
            warning("No contrasts remain after filtering remove_pairs.");
            return(NULL);
         }
      }

      if (verbose >= 2) {
         jamba::printDebug("groups_to_sedesign(): ",
            indent=(current_depth-1)*5,
            "iContrastNames:");
         print(head(iContrastNames, 100));
      }

      ##################################################
      # Interaction contrasts (iterative processing)
      if (length(setdiff(colnames(iContrastNames), "contrastName")) > 1 &&
            current_depth < max_depth) {
         iContrastNamesUse <- iContrastNames[,iContrastGroupsUse,drop=FALSE];
         for (i in iContrastGroupsUse) {
            j <- jamba::provigrep(c("^[^,]+$", "."), iContrastNamesUse[[i]]);
            iContrastNamesUse[[i]] <- factor(iContrastNamesUse[[i]],
               levels=unique(j));
         }
         if (verbose >= 2) {
            jamba::printDebug("groups_to_sedesign(): ",
               indent=(current_depth-1)*5,
               "   Defining interactions contrasts.");
            print(head(iContrastNamesUse[,iContrastGroupsUse,drop=FALSE], 100));
         }
         iContrastNamesInt <- groups_to_sedesign(iContrastNamesUse,
            omit_grep=omit_grep,
            current_depth=current_depth + 1,
            max_depth=max_depth,
            return_sedesign=FALSE,
            factor_sep=factor_sep,
            factor_order=rev(factor_order),
            contrast_sep=contrast_sep,
            rename_first_depth=rename_first_depth,
            remove_pairs=remove_pairs,
            pre_control_terms=pre_control_terms,
            verbose=verbose,
            ...);
         # return value should be list with contrast_df, contrast_names, idesign
         if (verbose >= 2) {
            jamba::printDebug("groups_to_sedesign(): ",
               indent=(current_depth-1)*5,
               "length(iContrastNamesInt):",
               length(iContrastNamesInt));
            print(iContrastNamesInt);
         }
         ## removed 0.0.31.900
         # If length==0 then there are no valid interaction contrasts
         # if (length(iContrastNamesInt) > 0 &&
         #       jamba::igrepHas("[(]", rownames(iContrastNamesInt[[1]]))) {
         #    return(iContrastNamesInt);
         # }
         ## updated 0.0.31.900: although not sure why there might be NA values
         if (length(iContrastNamesInt) > 0 &&
               ncol(iContrastNamesInt$contrast_df) > 1 &&
               any(is.na(iContrastNamesInt$contrast_df[,1]))) {
            iContrastNamesInt$contrast_df <- subset(iContrastNamesInt$contrast_df,
               !is.na(iContrastNamesInt$contrast_df[,1]));
         }
         ## updated 0.0.31.900: if there are interaction contrasts, append them
         if (length(iContrastNamesInt$contrast_df) > 0 &&
               ncol(iContrastNamesInt$contrast_df) > 1) {
            if (verbose >= 1) {
               jamba::printDebug("groups_to_sedesign(): ",
                  indent=(current_depth-1)*5,
                  "Combining iContrastNames with iContrastNamesInt.");
            }
            if (verbose >= 2) {
               jamba::printDebug("groups_to_sedesign(): ",
                  indent=(current_depth-1)*5,
                  "head(iContrastNames):");
               print(head(iContrastNames));
               jamba::printDebug("groups_to_sedesign(): ",
                  indent=(current_depth-1)*5,
                  "begin(iContrastNamesInt):");
               print(head(iContrastNamesInt));
               # jamba::printDebug("  end iContrastNamesInt:");
            }
            iContrastNames <- jamba::rbindList(list(iContrastNames,
               iContrastNamesInt$contrast_df));
            if (verbose >= 1) {
               jamba::printDebug("groups_to_sedesign(): ",
                  indent=(current_depth-1)*5,
                  "combined head(iContrastNames):");
               print(head(iContrastNames));
               jamba::printDebug("groups_to_sedesign(): ",
                  indent=(current_depth-1)*5,
                  "combined tail(iContrastNames):");
               print(tail(iContrastNames));
            }
         }
      } else {
         # no interaction contrasts to be defined
         if (verbose >= 2) {
            jamba::printDebug("groups_to_sedesign(): ",
               indent=(current_depth-1)*5,
               "   Skipping interactions");
            jamba::printDebug("      ncol(iContrastNames):",
               indent=(current_depth-1)*5,
               ncol(iContrastNames));
            jamba::printDebug("      head(iContrastNames):");
            print(head(iContrastNames));
         }
      }
      if ("contrastName" %in% colnames(iContrastNames)) {
         rownames(iContrastNames) <- jamba::makeNames(iContrastNames[["contrastName"]]);
         contrast_names <- unique(iContrastNames[["contrastName"]]);
      }
   }
   # end of automatic contrast definition
   ######################################################

   retvals <- list();
   if (TRUE %in% return_sedesign && current_depth == 1) {
      icontrasts <- NULL;
      if (!is.null(idesign) && length(contrast_names) > 0) {
         icontrasts <- limma::makeContrasts(contrasts=contrast_names,
            levels=idesign);
      }
      retvals <- validate_sedesign(
         new("SEDesign",
            design=idesign,
            contrasts=icontrasts));
   } else {
      retvals$contrast_df <- iContrastNames;
      retvals$contrast_names <- contrast_names;
      retvals$idesign <- idesign;
   }
   if (verbose) {
      jamba::printDebug("groups_to_sedesign(): ",
         indent=(current_depth-1)*5,
         "current_depth: ", current_depth,
         ", return_sedesign: ", return_sedesign)
   }
   return(retvals);
}



#' Sort biological sample labels for experimental design
#'
#' Sort biological sample labels for experimental design
#'
#' This function sorts a vector of sample labels using typical
#' heuristics that order typical control groups terms before
#' test groups. For example, `"Vehicle"` would be returned
#' before `"Treatment"` since `"Vehicle"` is a recognized control
#' term.
#'
#' It also employs `jamba::mixedSort()` for
#' proper alphanumeric sorting, for example so `"Time_5hr"` would
#' be sorted before `"Time_12hr"`.
#'
#' @return character vector ordered such that control terms are
#' preferentially first before non-control terms.
#'
#' @param x character vector or factor
#' @param control_terms vector of regular expression patterns used to
#'    determine control terms, where the patterns are matched and
#'    returned in order.
#' @param pre_control_terms vector or NULL, optional control
#'    terms or regular expressions to use before the `control_terms`
#'    above. This argument is used as a convenient prefix to the
#'    default terms.
#' @param post_control_terms vector or NULL, optional control
#'    terms or regular expressions to use after the `control_terms`
#'    above. This argument is used as a convenient suffix to the
#'    default terms.
#' @param ignore.case logical passed to `jamba::provigrep()` indicating
#'    whether to ignore case-sensitive matching.
#' @param boundary logical indicating whether to require a word
#'    boundary at either the start or end of the control terms.
#'    When TRUE, it uses `perl=TRUE` by default, and allows either
#'    perl boundary or an underscore `"_"`.
#' @param perl logical indicating whether to use Perl regular
#'    expression pattern matching.
#' @param keep_factor_order logical indicating whether to maintain
#'    factor level order, if `x` is supplied as a factor. If
#'    `keep_factor_order==TRUE` then only `sort(x)` is returned.
#' @param ... additional arguments are ignored.
#'
#' @family jam string functions
#' @family jam RNA-seq functions
#'
#' @examples
#' # the defaults perform well for clear descriptors
#' sort_samples(c("Trt_12h", "Trt_9h", "Trt_1h", "Trt_9h", "Vehicle"));
#'
#' # custom terms can be added before the usual control terms
#' sort_samples(c("Trt_12h", "Trt_9h", "Trt_1h", "Trt_9h", "Fixated", "Vehicle"),
#'    pre_control_terms="fixate");
#'
#' # custom terms can be added after the usual control terms
#' sort_samples(c("Trt_12h", "Trt_9h", "Trt_1h", "Trt_9h", "Fixated", "Vehicle"),
#'    post_control_terms="fixate");
#'
#' @export
sort_samples <- function
(x,
 control_terms=c(
    "WT|wildtype",
    "normal|healthy|healthycontrol|^hc$",
    "control|ctrl|ctl",
    "(^|[-_ ])(NT|NTC)($|[-_ ]|[0-9])",
    "none|empty|blank",
    "untreated|untrt|untreat",
    "Vehicle|veh",
    "ETOH|ethanol",
    #"(time|day|hour|min|minute)[s]{0,1}[0]",
    "scramble|mock|sham",
    "ttx|PBS",
    "knockout",
    "mutant"),
 sortFunc=jamba::mixedSort,
 pre_control_terms=NULL,
 post_control_terms=NULL,
 ignore.case=TRUE,
 boundary=TRUE,
 perl=boundary,
 keep_factor_order=TRUE,
 ...)
{
   ## Purpose is to order sample names by typical descriptions
   ## of control groups versus treatment groups
   ##
   ## Test set:
   ## sort_samples(c("Trt_12h", "Trt_9h", "Trt_1h", "Vehicle"))
   ## sort_samples(c("RA_Brg1", "EtOH_WT", "RA_WT", "EtOH_Brg1"))
   ## sort_samples(c("HCTWT_DXR6", "HCTWT_DXR12", "HCTWT_DXR24", "HCTWT_NT24"))
   #order1 <- jamba::proigrep(c(control_terms), x);
   #order2 <- jamba::proigrep(c(control_terms, "."), sortFunc=sortFunc, x);
   ##
   ## keep_factor_order=TRUE will keep factor levels unchanged, and use those levels in the sort
   ## instead of looking for control terms
   if (keep_factor_order && jamba::igrepHas("factor", class(x))) {
      sort(x);
   } else {
      control_terms <- unique(c(pre_control_terms,
         control_terms,
         post_control_terms));
      if (any(boundary)) {
         # Require regular expression boundary
         control_terms1 <- unlist(lapply(control_terms, function(i){
            paste0("(_|\\b)(", i, ")|(", i, ")(_|\\b)")
         }))
         if (any(!boundary)) {
            control_terms <- c(control_terms1,
               control_terms);
         } else {
            control_terms <- control_terms1;
         }
      }
      xU <- jamba::provigrep(c(control_terms, "."),
         sortFunc=sortFunc,
         perl=perl,
         ignore.case=ignore.case,
         x);
      xOrder <- order(match(x, xU));
      x <- x[xOrder];
      #attr(x, "control_terms") <- control_terms;
      x;
   }
}

#' Split the elements of an ordered factor vector
#'
#' Split the elements of an ordered factor vector
#'
#' This function performs `base::strsplit()` while trying to maintain
#' the order of factor levels in the output, based upon the order of
#' factor levels in the input data.
#'
#' @return list of factor vectors, where each factor shares the same
#'    global factor levels based upon the input data.
#'
#' @param x character or factor vector.
#' @param split character split value sent to `base::strsplit()`.
#' @param fixed,perl,useBytes additional arguments sent to `base::split()`.
#' @param sortFunc function used to sort character values when the input
#'    `x` is a character vector. The default `jamba::mixedSort()` applies
#'    alphanumeric sort.
#' @param keepOrder logical indicating whether to keep the order of values
#'    in the input data, for example with character input the values will
#'    be ordered by the first appearance of each term.
#' @param ... additional arguments are ignored.
#'
#' @family jamses utilities
#'
#' @examples
#' # first define a vector of sample groups
#' iGroups <- jamba::nameVector(paste(rep(c("WT", "KO"), each=6),
#'    rep(c("Control", "Treated"), each=3),
#'    sep="_"));
#' iGroups <- factor(iGroups, levels=unique(iGroups));
#' iGroups;
#' strsplitOrdered(iGroups, "_");
#'
#' @export
strsplitOrdered <- function
(x,
 split="_",
 fixed=FALSE,
 perl=FALSE,
 useBytes=FALSE,
 sortFunc=jamba::mixedSort,
 keepOrder=TRUE,
 ...)
{
   ## Purpose is to run strsplit() on factors, ordering the new factor
   ## levels consistent with the input
   if (!jamba::igrepHas("factor", class(x))) {
      if (keepOrder) {
         x <- factor(x,
            levels=unique(x));
      } else {
         x <- factor(x,
            levels=sortFunc(unique(x)));
      }
   }
   soL <- strsplit(x=levels(x),
      split=split,
      fixed=fixed,
      perl=perl,
      useBytes=useBytes);
   so1 <- jamba::rbindList(soL);

   ## Note: the setdiff() is there to remove "" values
   so1levels <- setdiff(unique(unlist(apply(so1, 2, unique))), "");
   soSplitL <- strsplit(as.character(x),
      split=split,
      fixed=fixed,
      perl=perl,
      useBytes=useBytes);
   soLordered <- lapply(soSplitL, function(i){
      factor(i,
         levels=so1levels);
   });
   return(soLordered);
}


#' Intercalate two or more vectors
#'
#' @export
intercalate <- function
(...)
{
   ## Purpose is to take a list of vectors, and intercalate their values, e.g.
   ## list1 <- paste("name1", letters[1:10], sep="");
   ## list2 <- paste("name2", letters[1:10], sep="");
   ## intercalate(list1, list2);
   ## name1a, name2a, name1b, name2b, name1c, name2c, etc.
   ##
   ## The special case where there are two lists, and the first has
   ## one element more than the second, then the second will only have
   ## its values in between the first, e.g.
   ## A B A B A B A
   ##
   ## Note: rmNULL() will remove empty lists
   aList <- jamba::rmNULL(list(...));
   if (length(aList) == 1 && class(aList[[1]]) %in% "list") {
      aList <- aList[[1]];
   }
   ## do.call will automatically repeat any vector to fill each row
   ## up to the maximum number of columns.
   if (length(unique(lengths(aList))) > 1) {
      ## Unequal lengths, to avoid warning should we expand them?
   }
   aMatrix <- do.call(rbind, aList);
   newVector <- as.vector(aMatrix);

   ## The special case where intercalating two vectors,
   ## where the second vector has one fewer entry, we
   ## will not repeat the last entry.
   ## E.g.
   ## c("A","A","A")
   ## c("B","B")
   ##
   ## desired output is
   ## c("A","B","A","B","A")
   if (length(aList) == 2 && length(aList[[1]]) == (length(aList[[2]]) + 1)) {
      newVector <- head(newVector, -1);
   }
   return(newVector);
}


#' Convert list to incidence matrix
#'
#' Convert list to incidence matrix
#'
#' @param setlist `list` of vectors
#' @param empty default single value used for empty/missing entries.
#' @param do_sparse `logical` indicating whether to convert output
#'    to `ngCMatrix` which is best for extremely large incidence
#'    matrix data.
#' @param ... additional arguments are ignored.
#'
#' @export
list2im_opt <- function
(setlist,
 empty=0,
 do_sparse=TRUE,
 ...)
{
   setnamesunion <- Reduce("union", setlist);
   if (length(empty) == 0) {
      empty <- NA;
   } else {
      empty <- head(empty, 1);
   }
   setlistim <- do.call(cbind, lapply(setlist, function(i){
      i_match <- match(i, setnamesunion);
      j <- rep(empty,
         length(setnamesunion));
      j[i_match] <- 1;
      j;
   }))
   rownames(setlistim) <- setnamesunion;
   if (do_sparse && suppressPackageStartupMessages(require(Matrix))) {
      setlistim <- as(setlistim, "ngCMatrix");
   }
   return(setlistim);
}
jmw86069/jamses documentation built on May 31, 2024, 1:36 p.m.