R/genejam-freshen.R

#' Freshen gene annotations using Bioconductor annotation data
#' 
#' Freshen gene annotations using Bioconductor annotation data
#' 
#' This function takes a vector or `data.frame` of gene symbols,
#' and uses Bioconductor annotation methods to find the most current
#' official gene symbol.
#' 
#' The annotation process runs in two basic steps:
#'
#' 1. **Convert the input gene to Entrez gene ID**.
#' 2. **Convert Entrez gene ID to official gene symbol**.
#'
#' ## Step 1. Convert to Entrez gene ID
#'
#' The first step uses an ordered list of annotations,
#' with the assumption that the first match is usually the best,
#' and most specific. By default, the order is:
#'
#' * `"org.Hs.egSYMBOL2EG"` -- almost always 1-to-1 match
#' * `"org.Hs.egACCNUM2EG"` -- mostly a 1-to-1 match
#' * `"org.Hs.egALIAS2EG"` -- sometimes a 1-to-1 match, sometimes 1-to-many
#'
#' When multiple Entrez gene ID values are matched, they are all
#' retained. See argument `handle_multiple` for custom options.
#'
#' ## Step 2. Use Entrez gene ID to return official annotation
#'
#' The second step converts the Entrez gene ID (or multiple IDs)
#' to the official gene symbol, by default using `"org.Hs.egSYMBOL"`.
#'
#' The second step may optionally include multiple annotation types,
#' each of which will be returned. Some common examples:
#'
#' * `"org.Hs.egSYMBOL"` -- official Entrez gene symbol
#' * `"org.Hs.egALIAS"` -- set of recognized aliases for an Entrez gene.
#' * `"org.Hs.egGENENAME"` -- official Entrez long gene name
#'
#' For each step, the annotation matched can be returned, as an audit
#' trail to see which annotation was available for each input entry.
#' 
#' Note that if the input data already contains Entrez gene ID
#' values, you can define that colname with argument `intermediate`.
#' 
#' ## Case-insensitive search
#' 
#' For case-insensitive search, which is particularly useful in non-human
#' organisms because they often use mixed-case, use the argument
#' `ignore.case=TRUE`. In our benchmark tests it appears to add roughly
#' 0.1 seconds per annotation, regardless of the number of input entries.
#' This appears to be the time it takes to spool the list of annotation
#' keys stored in the SQLite database, and may therefore be dependent upon
#' the size of the annotation file.
#' 
#' @return `data.frame` with one or more columns indicating the input
#' data, then a column `"intermediate"` containing the Entrez gene ID
#' that was matched, then one column for each item in `final`,
#' by default `"SYMBOL"`.
#' 
#' @family genejam
#' 
#' @param x character vector or `data.frame` with one or most columns
#'    containing gene symbols.
#' @param ann_lib character vector indicating the name or names of the
#'    Bioconductor annotation library to use when looking up
#'    gene nomenclature.
#' @param try_list character vector indicating one or more names of
#'    annotations to use for the input gene symbols in `x`. The
#'    annotation should typically return the Entrez gene ID, usually
#'    given by `'2EG'` at the end of the name. For example `SYMBOL2EG`
#'    will be used with ann_lib `"org.Hs.eg.db"` to produce annotation
#'    name `"org.Hs.egSYMBOL2EG"`. Note that when the `'2EG'` form of
#'    annotation does not exist (or another suitable suffix defined in 
#'    argument `"revmap_suffix"` in `get_anno_db()`), it will be derived
#'    using `AnnotationDbi::revmap()`. For example if `"org.Hs.egALIAS"`
#'    is requested, but only `"org.Hs.egALIAS2EG"` is available, then
#'    `AnnotationDbi::revmap(org.Hs.egALIAS2EG)` is used to create the
#'    equivalent of `"org.Hs.egALIAS"`.
#' @param final character vector to use for the final conversion
#'    step. When `final` is `NULL` no conversion is performed.
#'    When `final` contains multiple values, each value is returned
#'    in the output. For example, `final=c("SYMBOL","GENENAME")` will
#'    return a column `"SYMBOL"` and a column `"GENENAME"`.
#' @param split character value used to separate delimited values in `x`
#'    by the function `base::strsplit()`. The default will split values
#'    separated by comma `,` semicolon `;` or forward slash `/`, and will
#'    trim whitespace before and after these delimiters.
#' @param sep character value used to concatenate multiple entries in
#'    the same field. The default `sep=","` will comma-delimit multiple
#'    entries in the same field.
#' @param handle_multiple character value indicating how to handle multiple
#'    values: `"first_hit"` will query each column of `x` until it finds the
#'    first possible returning match, and will ignore all subsequent possible
#'    matches for that row in `x`. For example, if one row in `x` contains
#'    multiple values, only the first match will be used. `"first_try"`
#'    will return the first match from `try_list` for all columns in `x`
#'    that contain a match. For example, if one row in `x` contains two
#'    values, the first match from `try_list` using one or both columns in
#'    `x` will be maintained. Subsequent entries in `try_list` will not be
#'    attempted for rows that already have a match. `"all"` will return all
#'    possible matches for all entries in `x` using all items in `try_list`.
#' @param empty_rule character value indicating how to handle entries which
#'    did not have a match, and are therefore empty: `"original"` will use
#'    the original entry as the output field; `"empty"` will leave the
#'    entry blank.
#' @param include_source logical indicating whether to include a column
#'    that shows the colname and source matched. For example, if column
#'    `"original_gene"` matched `"SYMBOL2EG"` in `"org.Hs.eg.db"` there
#'    will be a column `"found_source"` with value
#'    `"original_gene.org.Hs.egSYMBOL2EG"`.
#' @param protect_inline_sep logical indicating whether to
#'    protect inline characters in `sep`, to prevent them from
#'    being used to split single values into multiple values.
#'    For example, `"GENENAME"` returns the full gene name, which
#'    often contains comma `","` characters. These commas do
#'    not separate multiple separate values, so they should not be
#'    used to split a string like `"H4 clustered histone 10, pseudogene"`
#'    into two strings `"H4 clustered histone 10"` and `"pseudogene"`.
#' @param intermediate `character` string with colname in `x` that
#'    contains intermediate values. These values are expected from output
#'    of the first step in the workflow, for example `"SYMBOL2EG"`
#'    returns Entrez gene values, so if the input `x` already contains
#'    some of these values in a column, assign that colname to
#'    `intermediate`.
#' @param ignore.case `logical` indicating whether to use
#'    case-insensitive matching when `ignore.case=TRUE`, otherwise
#'    the default `ignore.case=FALSE` will perform default `mget()`
#'    which requires the upper and lowercase characters are
#'    an identical match. When `ignore.case=TRUE` this function
#'    calls `genejam::imget()`.
#' @param verbose logical indicating whether to print verbose output.
#' 
#' @examples
#' if (suppressPackageStartupMessages(require(org.Hs.eg.db))) {
#'    cat("\nBasic usage\n");
#'    print(freshenGenes(c("APOE", "CCN2", "CTGF")));
#' }
#' 
#' if (suppressPackageStartupMessages(require(org.Hs.eg.db))) {
#'    ## Optionally show the annotation source matched
#'    cat("\nOptionally show the annotation source matched\n");
#'    print(freshenGenes(c("APOE", "CCN2", "CTGF"), include_source=TRUE));
#' }
#' 
#' if (suppressPackageStartupMessages(require(org.Hs.eg.db))) {
#'    ## Show comma-delimited genes
#'    cat("\nInput genes are comma-delimited\n");
#'    print(freshenGenes(c("APOE", "CCN2", "CTGF", "CCN2,CTGF")));
#' }
#' 
#' if (suppressPackageStartupMessages(require(org.Hs.eg.db))) {
#'    ## Optionally include more than SYMBOL in the output
#'    cat("\nCustom output to include SYMBOL, ALIAS, GENENAME\n");
#'    print(freshenGenes(c("APOE", "HIST1H1C"),
#'       final=c("SYMBOL", "ALIAS", "GENENAME")));
#' }
#' 
#' if (suppressPackageStartupMessages(require(org.Hs.eg.db))) {
#'    ## More advanced, match affymetrix probesets
#'    if (suppressPackageStartupMessages(require(hgu133plus2.db))) {
#'       cat("\nAdvanced example including Affymetrix probesets.\n");
#'       print(freshenGenes(c("227047_x_at","APOE","HIST1H1D","NM_003166,U08032"),
#'          include_source=TRUE,
#'          try_list=c("hgu133plus2ENTREZID","REFSEQ2EG","SYMBOL2EG","ACCNUM2EG","ALIAS2EG"),
#'          final=c("SYMBOL","GENENAME")))
#'    }
#' }
#' 
#' @export
freshenGenes <- function
(x,
 ann_lib=c("","org.Hs.eg.db"),
 try_list=c("SYMBOL2EG", "ACCNUM2EG", "ALIAS2EG"),
 final=c("SYMBOL"),
 split="[ ]*[,/;]+[ ]*",
 sep=",",
 handle_multiple=c("first_try", "first_hit", "all", "best_each"),
 empty_rule=c("empty", "original", "na"),
 include_source=FALSE,
 protect_inline_sep=TRUE,
 intermediate="intermediate",
 ignore.case=FALSE,
 verbose=FALSE,
 ...)
{
   ###
   handle_multiple <- match.arg(handle_multiple);
   empty_rule <- match.arg(empty_rule);
   if (length(ann_lib) == 0) {
      ann_lib <- "";
   }
   test_ann_lib <- setdiff(ann_lib, c(NA,""));
   if (length(test_ann_lib) > 0) {
      for (test_lib in test_ann_lib) {
         if (!suppressPackageStartupMessages(require(test_lib, character.only=TRUE))) {
            stop(paste0("Package '", test_lib, "' is not available."));
         }
      }
   }
   if (is.atomic(x)) {
      x <- data.frame(input=as.character(x),
         stringsAsFactors=FALSE,
         check.names=FALSE);
   }
   if (length(colnames(x)) == 0) {
      colnames(x) <- jamba::makeNames(rep("input", ncol(x)));
   }
   
   ## colnames_x are the colnames(x) that are not intermediate
   intermediate_source <- paste0(intermediate, "_source");
   colnames_x <- setdiff(colnames(x),
      c(intermediate, intermediate_source));
   ## ncol_x is the number of columns that are not intermediate
   ncol_x <- length(colnames_x);
   
   ## handle_multiple="best_each"
   if ("best_each" %in% handle_multiple) {
      if (verbose) {
         jamba::printDebug("freshenGenes(): ",
            "handle_multiple:",
            handle_multiple);
      }
      if (ncol_x == 1) {
         ## Check for delimited values
         if (length(split) > 0 && nchar(split) > 0 && jamba::igrepHas(split, x[[colnames_x]])) {
            ## Split the input by delimiter
            taller_list <- strsplit(x[[colnames_x]], split);
            # 30mar2021: fill NULL with "" so the original empty entry is not lost
            taller_list[lengths(taller_list) == 0] <- "";
            ## Make a vector and associated factor to split back into a list
            taller_idx <- rep(seq_along(taller_list),
               lengths(taller_list));
            taller_factor <- rep(factor(seq_along(taller_list)),
               lengths(taller_list));
            taller_x <- x[taller_idx,,drop=FALSE];
            taller_x[[colnames_x]] <- unlist(taller_list);
            #taller_vector <- unlist(taller_list);
         } else {
            taller_x <- x;
            #taller_vector <- x[[colnames_x]];
            taller_factor <- NULL;
         }
         ## run freshenGenes()
         ## empty_rule="na" here so blank entries will get dropped
         ## then we can replace as needed later
         taller_freshened <- freshenGenes(x=taller_x,
            ann_lib=ann_lib,
            try_list=try_list,
            final=final,
            split=split,
            sep=sep,
            handle_multiple="first_try",
            empty_rule="na",
            include_source=include_source,
            protect_inline_sep=protect_inline_sep,
            intermediate=intermediate,
            ignore.case=ignore.case,
            verbose=FALSE);
         ## Split back into the original vector
         if (length(taller_factor) == 0) {
            return(taller_freshened);
         }
         final_colnames <- colnames(taller_freshened);
         x_new <- do.call(cbind, lapply(jamba::nameVector(final_colnames), function(i){
            taller_freshened_split <- split(taller_freshened[[i]],
               taller_factor);
            ## comma-delimit using only unique entries
            idf <- data.frame(check.names=FALSE,
               stringsAsFactors=FALSE,
               output=jamba::cPasteU(taller_freshened_split,
                  sep=sep,
                  na.rm=TRUE));
            colnames(idf) <- i;
            idf;
         }));
         return(x_new);
      }
   }
   
   ## Expand columns containing delimited values if necessary
   # This step makes multiple values appear in separate columns
   # on the same row.
   if (length(split) > 0 && nchar(split) > 0) {
      x <- data.frame(stringsAsFactors=FALSE,
         check.names=FALSE,
         do.call(cbind,
            lapply(jamba::nameVector(colnames(x)), function(i){
               ix <- as.character(x[[i]]);
               # only split delimited values when it is not intermediate
               # note that empty entries are filled with ""
               if (i %in% colnames_x) {
                  if (jamba::igrepHas(split, ix)) {
                     ix <- jamba::rbindList(
                        jamba::rmNULL(strsplit(as.character(ix), split),
                           nullValue=""));
                     colnames(ix) <- jamba::makeNames(rep(i, ncol(ix)));
                  }
               }
               ix;
            })
         )
      );
   }
   # updated to ignore intermediate and intermediate_source
   xnames <- setdiff(colnames(x),
      c(intermediate, intermediate_source));

   if (length(try_list) > 0) {
      if (!intermediate %in% colnames(x)) {
         x[[intermediate]] <- rep("", nrow(x));
      }
      if (!intermediate_source %in% colnames(x)) {
         x[[intermediate_source]] <- rep("", nrow(x));
         if ("first_try" %in% handle_multiple) {
            x[["found_try"]] <- rep(TRUE, nrow(x));
         }
      }
   }

   for (itry in try_list) {
      for (iann in ann_lib) {
         if (verbose) {
            jamba::printDebug("freshenGenes(): ",
               "iann:'",
               iann, "'");
         }
         if ("character" %in% class(itry)) {
            itryname <- paste0(gsub("[.]db$", "", iann), itry);
            if (verbose) {
               jamba::printDebug("freshenGenes(): ",
                  "itryname:'",
                  itryname, "'");
            }
            ienv <- get_anno_db(itryname,
               verbose=verbose,
               ignore.case=FALSE,
               ...);
            itryname <- attr(ienv, "annoname");
         } else {
            if (verbose) {
               jamba::printDebug("freshenGenes(): ",
                  "Using ann_lib entry as-is");
            }
            ienv <- get_anno_db(itry,
               verbose=verbose,
               ignore.case=FALSE,
               ...);
            itryname <- attr(ienv, "annoname");
         }
         if (length(ienv) == 0) {
            if (verbose) {
               jamba::printDebug("freshenGenes(): ",
                  "   Skipping", fgText=c("darkorange1", "red"));
            }
            next;
         }
         
         for (iname in xnames) {
            if (verbose) {
               jamba::printDebug("freshenGenes(): ",
                  "   iname:",
                  iname);
            }
            ifound <- x[[intermediate]];
            ifound_source <- x[[intermediate_source]];
            if ("first_hit" %in% handle_multiple) {
               ## input must have characters
               ## must have no found result
               ido <- (genejam::is_empty(ifound) &
                     !genejam::is_empty(x[[iname]]))
            } else if ("first_try" %in% handle_multiple) {
               ## input must have characters
               ## previous try must have no result
               ido <- (x[["found_try"]] &
                     !genejam::is_empty(x[[iname]]));
            } else {
               ## input must have characters, and not be empty
               ido <- !genejam::is_empty(x[[iname]]);
            }
            ix <- x[[iname]][ido];
            if (length(ix) == 0) {
               if (verbose) {
                  jamba::printDebug("freshenGenes(): ",
                     "      Skipping because ", "0", " entries to query.",
                     fgText=c("darkorange1", "red"));
               }
               next;
            }
            if (verbose) {
               jamba::printDebug("freshenGenes(): ",
                  "      Querying ",
                  jamba::formatInt(length(ix)),
                  " entries.",
                  fgText=c("darkorange1", "aquamarine3"));
            }
            ixu <- jamba::rmNA(as.character(unique(ix)));
            if (ignore.case) {
               ivals_l <- imget(ixu,
                  ienv,
                  ifnotfound=NA);
            } else {
               ivals_l <- AnnotationDbi::mget(ixu,
                  ienv,
                  ifnotfound=NA);
            }
            if (verbose) {
               jamba::printDebug("freshenGenes(): ",
                  "      Complete.");
            }
            ## Data cleaning step to protect values which have delimiters
            if (protect_inline_sep && jamba::igrepHas(sep, unlist(ivals_l))) {
               ## convert to dummy '!:!'
               if (verbose) {
                  jamba::printDebug("freshenGenes(): ",
                     "Converted intermediate '", 
                     sep, 
                     "' to '", 
                     "!:!", 
                     "'");
               }
               ivals_l <- lgsub(sep, "!:!", ivals_l);
            }
            
            ivals <- jamba::cPaste(ivals_l,
               sep=sep,
               na.rm=TRUE);

            names(ivals) <- ixu;
            ivals <- ivals[!is.na(ivals)];
            ixnew <- ivals[match(ix, names(ivals))];
            ixdo <- (nchar(ixnew) > 0);
            
            iname_tryname <- itryname;
            if (any(c("first_try", "all") %in% handle_multiple)) {
               # fix slight bug in handling ifound_source=""
               # that would create output like ",source_name"
               ifound_source[ido][ixdo] <- ifelse(
                  nchar(ifound[ido][ixdo]) == 0,
                  iname_tryname,
                  ifelse(nchar(ifound_source[ido][ixdo]) == 0,
                     iname_tryname,
                     paste0(ifound_source[ido][ixdo], sep, iname_tryname)));
               ifound[ido][ixdo] <- ifelse(
                  nchar(ifound[ido][ixdo]) == 0,
                  ixnew[ixdo],
                  paste0(ifound[ido][ixdo], sep, ixnew[ixdo]));
            } else {
               ifound[ido][ixdo] <- ixnew[ixdo];
               ifound_source[ido][ixdo] <- iname_tryname;
            }
            x[[intermediate]] <- ifound;
            x[[intermediate_source]] <- ifound_source;
         }
         if ("first_try" %in% handle_multiple) {
            isempty <- genejam::is_empty(x[[intermediate]]);
            x[["found_try"]][!isempty] <- FALSE;
         }
      }
   }
   
   ###################################
   ## Remove found_try column
   if ("first_try" %in% handle_multiple) {
      x <- x[,setdiff(colnames(x), "found_try"),drop=FALSE];
   }
   
   ###################################
   ## Optionally remove found_source
   if (!include_source) {
      x <- x[,setdiff(colnames(x), intermediate_source),drop=FALSE];
   }
   
   ###################################
   ## Make intermediate values unique
   ## also this step sorts delimited values
   if (intermediate %in% colnames(x) && any(nchar(x[[intermediate]]) > 0)) {
      xfoundu <- unique(x[[intermediate]]);
      if (length(split) > 0 && nchar(split) > 0) {
         xfounduv <- jamba::cPasteSU(strsplit(xfoundu, split),
            na.rm=TRUE,
            sep=sep);
      } else {
         xfounduv <- xfoundu;
      }
      x[[intermediate]] <- xfounduv[match(x[[intermediate]], xfoundu)];
   }
   
   ## Revert protected sep values
   if (protect_inline_sep && jamba::igrepHas("!:!", x[[intermediate]])) {
      ## convert from dummy '!:!' to sep
      if (verbose) {
         jamba::printDebug("freshenGenes(): ",
            "Converted intermediate '", 
            "!:!", 
            "' back to '", 
            sep, 
            "'");
      }
      x[[intermediate]] <- gsub("!:!", sep, x[[intermediate]]);
   }
   
   ## make found_source values unique, if they are retained in output
   if (include_source && intermediate_source %in% colnames(x)) {
      xfoundsu <- unique(x[[intermediate_source]]);
      if (length(split) > 0 && nchar(split) > 0) {
         xfoundsuv <- jamba::cPasteU(strsplit(xfoundsu, split),
            na.rm=TRUE,
            sep=sep);
      } else {
         xfoundsuv <- xfoundsu;
      }
      x[[intermediate_source]] <- xfoundsuv[match(x[[intermediate_source]], xfoundsu)];
   }
   if (verbose) {
      jamba::printDebug("freshenGenes(): ",
         "head(x, 10):");
      print(head(x, 10));
   }
   
   ###################################
   ## final arrangement of columns
   ## 30mar2021: changed to stop using "found"
   ## which required renaming to "intermediate"
   ## and use intermediate directly
   if (length(final) > 0) {
      #xnames <- colnames(x);
      #xnames <- jamba::makeNames(
      #   gsub("^found", 
      #      "intermediate",
      #      xnames),
      #   renameFirst=FALSE);
      #colnames(x) <- xnames;
      
      ## LOC# recovery for entries that have no intermediate
      # 30mar2021 this change in isempty should be slightly faster
      #isempty <- (nchar(jamba::rmNA(naValue="", x[[intermediate]])) == 0);
      #isempty <- (is.na(x[[intermediate]]) | nchar(x[[intermediate]]) == 0);
      isempty <- genejam::is_empty(x[[intermediate]]);
      if (any(isempty) && length(xnames) > 0) {
         xnames1 <- head(xnames, 1);
         isloc <- grepl("^LOC[0-9]+$", x[[xnames1]][isempty]);
         if (any(isloc)) {
            if (verbose) {
               jamba::printDebug("freshenGenes(): ",
                  "Converting ",
                  jamba::formatInt(sum(isloc)),
                  " entries with format ",
                  "'LOC1234567'",
                  " and no intermediate, to: ",
                  "'1234567'", " format.");
            }
            # replace "LOC211052" with "211052"
            x[[intermediate]][isempty][isloc] <- gsub("^LOC",
               "",
               x[[xnames1]][isempty][isloc]);
         }
      }
      if (verbose) {
         jamba::printDebug("freshenGenes(): ",
            "Processing final data.frame, head(x, 10):");
         print(head(x, 10));
      }
      for (i in final) {
         if (verbose) {
            jamba::printDebug("freshenGenes(): ",
               "Processing final:", i);
         }
         # note intermediate=i and final=NULL
         # means the return data will have colname i
         x1 <- freshenGenes(x[[intermediate]],
            sep=sep,
            split=sep,
            handle_multiple=handle_multiple,
            ann_lib=ann_lib,
            try_list=i,
            final=NULL,
            include_source=FALSE,
            intermediate=i,
            verbose=verbose > 1,
            ...);
         x[[i]] <- x1[[i]];
         # This step should not be necessary for "final"
         #if (include_source) {
         #   i_source <- paste0(i, "_source");
         #   x[[i_source]] <- x1[[i_source]];
         #}
      }
      if ("original" %in% empty_rule) {
         ifinal <- head(final, 1);
         # 30mar2021 this change in isempty should be slightly faster
         #isempty <- (nchar(jamba::rmNA(naValue="", x[[ifinal]])) == 0);
         #isempty <- (is.na(x[[intermediate]]) | nchar(x[[intermediate]]) == 0);
         isempty <- genejam::is_empty(x[[intermediate]]);
         x[[ifinal]][isempty] <- x[[1]][isempty];
      } else if ("na" %in% empty_rule) {
         ifinal <- head(final, 1);
         # 30mar2021 this change in isempty should be slightly faster
         #isempty <- (nchar(jamba::rmNA(naValue="", x[[ifinal]])) == 0);
         #isempty <- (is.na(x[[intermediate]]) | nchar(x[[intermediate]]) == 0);
         isempty <- genejam::is_empty(x[[intermediate]]);
         x[[ifinal]][isempty] <- NA;
      }
   }
   
   return(x);
}

#' Get annotation database or environment
#' 
#' Get annotation database or environment
#' 
#' This function is a simple wrapper function that takes either an
#' annotation data name, for example from the `AnnotationDbi` package,
#' or an annotation object, and returns the annotation object.
#' 
#' In the event the annotation object must be derived using
#' `AnnotationDbi::revmap()`, then that process is performed, and the
#' reverse mapped annotation object is returned.
#' 
#' When invoking with argument `ignore.case=TRUE`, this function
#' returns keys that have been converted to lowercase. The process
#' can be fairly slow (~5 seconds per human genome annotation),
#' but results in an annotation that can be used directly and
#' may be faster for repeated use than calling `imget()`.
#' 
#' The alternative to `ignore.case=TRUE` is to call `imget()` for
#' direct query of an annotation, or call `freshenGenes(..., ignore.case=TRUE)`
#' which will call `imget()` internally.
#' In our benchmark tests, using `imget()` appears to add roughly
#' 0.1 seconds per query, regardless of the number of input entries.
#' This appears to be the time it takes to spool the list of annotation
#' keys stored in the SQLite database, and may therefore be dependent upon
#' the size of the annotation file.
#' 
#' @family genejam
#' 
#' @param x character name of an annotation object, or an annotation
#'    object itself.
#' @param revmap_suffix character string indicting the expected suffix
#'    that can be used to create reverse-mapped annotation data, for
#'    example the suffix `"2EG"` is used to indicate that annotation
#'    returns Entrez gene. When annotation does not contain this suffix,
#'    the annotation is reverse-mapped using `AnnotationDbi::revmap()`.
#' @param ignore.case `logical` indicating whether to return an environment
#'    after converting all keys to lowercase, which is one implementation
#'    choice to provide case-insensitive output while using standard
#'    `mget()`. This option may not be ideal, see description for defailts.
#'    In order to fulfill the potential, the subsequent `mget()` must also
#'    use `tolower(x)` on the input character vector. Note that this
#'    option is currently fairly slow, and uses more memory while the
#'    environment is loaded.
#' @param verbose logical indicating whetheer to print verbose output.
#' @param ... additional arguments are ignored.
#' 
#' @export
get_anno_db <- function
(x,
 revmap_suffix=c("2EG", "2ENTREZID", "2NAME"),
 ignore.case=FALSE,
 verbose=FALSE,
 ...)
{
   #
   if ("character" %in% class(x)) {
      if (verbose) {
         jamba::printDebug("get_anno_db(): ",
            x);
      }
      flip_itryname <- function(x, revmap_suffix, verbose=FALSE) {
         ## This function tests if the reciprocal name exists,
         ## and if so it returns that name.
         ## Otherwise it returns NULL.
         itryname <- NULL;
         if (length(revmap_suffix) > 0 && any(nchar(revmap_suffix) > 0)) {
            revmap_suffix <- revmap_suffix[nchar(revmap_suffix) > 0];
            revmap_anygrep <- paste0("(",
               jamba::cPaste(revmap_suffix,
                  na.rm=TRUE,
                  sep="|"), 
               ")$");
            if (jamba::igrepHas(revmap_anygrep, x)) {
               ## one of the revmap extensions exists as a suffix, remove it
               itryname <- gsub(revmap_anygrep, "", x);
               if (!better_exists(itryname)) {
                  itryname <- NULL;
               }
               return(itryname);
            }
            for (revmap_suffixi in revmap_suffix) {
               itryname <- paste0(x, revmap_suffixi);
               if (verbose) {
                  jamba::printDebug("flip_itryname(): ",
                     "itryname:",
                     itryname);
               }
               if (better_exists(itryname)) {
                  return(itryname);
               }
               itryname <- NULL;
            }
         }
         return(itryname);
      }
      if (better_exists(x)) {
         ## If the name exists, return it directly
         itry <- better_get(x);
      } else {
         ## If the name does not exist, test for the reciprocal name
         reciprocal_x <- flip_itryname(x,
            revmap_suffix=revmap_suffix,
            verbose=verbose);
         if (length(reciprocal_x) > 0) {
            itry <- AnnotationDbi::revmap(better_get(reciprocal_x));
         } else {
            if (verbose) {
               jamba::printDebug("get_anno_db(): ",
                  "Not found on the search path.");
            }
            return(NULL);
         }
      }
      attr(itry, "annoname") <- x;
   } else {
      itrynames <- jamba::provigrep(c("objTarget", "objName"),
         slotNames(x));
      itryname <- paste(
         unlist(lapply(itrynames, function(i){
            slot(x, i)
         })),
         collapse=".");
      attr(x, "annoname") <- itryname;
      itry <- x;
   }
   if (ignore.case) {
      itry_l <- as.list(itry);
      names(itry_l) <- tolower(names(itry_l));
      itry <- as.environment(itry_l);
      rm(itry_l);
   }
   return(itry);
}

#' Better exists()
#' 
#' Better exists()
#' 
#' @family jam utility functions
#' 
#' This function is a lightweight enhancement of `base::exists()`
#' that accepts a package prefix in the object name, for example
#' `"base::exists"`.
#' 
#' This function recognizes a package prefix `"packagename::"`
#' and uses it to determine the correct value for argument `where`.
#' If the package is not present in `search()` using the
#' form `"package:packagename"` then an error is thrown.
#' Otherwise if the package is on the search path, this
#' function simply calls `base::exists(x, where=posnum, ...)`.
#' 
#' If the input `x` does not contain a package prefix, then
#' this function simply calls `base::exists(x)`
#' for the default behavior.
#' 
#' @return logical vector with length equal to `length(x)`. Note that
#'    when a package prefix is supplied, when the package is not on
#'    the search path this function returns `FALSE`, and does not
#'    throw an error.
#' 
#' @param x character vector length 1 or more, indicating the
#'    object names, with or without package prefix.
#' @param where,mode,inherits arguments passed to `base::exists()`.
#'    Note that arguments `envir` and `frame` use the defaults,
#'    and therefore may not be compatible with using input `x`
#'    with more than one value.
#' @param ... additional arguments are passed to `base::exists()`.
#' 
#' #' @examples
#' exists("exists", where="package:base")
#' 
#' better_exists("base::exists")
#' 
#' @export
better_exists <- function
(x,
 where=-1,
 mode="any",
 inherits=TRUE,
 verbose=FALSE,
 ...)
{
   ##
   if (length(x) > 1) {
      where <- rep(where, length.out=length(x));
      mode <- rep(mode, length.out=length(x));
      inherits <- rep(inherits, length.out=length(x));
      be_result <- sapply(seq_along(x), function(xi){
         better_exists(x[[xi]],
            where=where[[xi]],
            mode=mode[[xi]],
            inherits=inherits[[xi]],
            ...);
      });
      return(be_result);
   }
   if (jamba::igrepHas("^.+::.+$", x)) {
      xpackage <- paste0("package:", gsub("^(.+)::(.+)$", "\\1", x));
      xbase <- gsub("^(.+)::(.+)$", "\\2", x);
      xpos <- jamba::rmNA(match(xpackage, search()));
      if (verbose) {
         jamba::printDebug("better_exists(): ",
            "xpackage:", xpackage);
         jamba::printDebug("better_exists(): ",
            "xbase:", xbase);
         jamba::printDebug("better_exists(): ",
            "xpos:", xpos);
      }
      if (length(xpos) == 0) {
         return(FALSE);
         stop(paste0("better_exists(): There is no package called '", 
            xpackage, 
            "' on the search() list."));
      }
      base::exists(xbase,
         where=xpos, 
         mode=mode,
         inherits=inherits,
         ...);
   } else {
      base::exists(x,
         where=where,
         mode=mode,
         inherits=inherits,
         ...);
   }
}

#' Better get()
#' 
#' Better get()
#' 
#' @family jam utility functions
#' 
#' This function is a lightweight enhancement of `base::get()`
#' that accepts a package prefix in the object name, for example
#' `"base::exists"`.
#' 
#' This function recognizes a package prefix `"packagename::"`
#' and uses it to determine the correct value for argument `where`.
#' If the package is not present in `search()` using the
#' form `"package:packagename"` then an error is thrown.
#' Otherwise if the package is on the search path, this
#' function simply calls `base::get(x, pos=posnum, ...)`.
#' 
#' If the input `x` does not contain a package prefix, then
#' this function simply calls `base::get()`
#' for the default behavior.
#' 
#' @return the R object found. If not object is found an error results.
#' 
#' @param character string with an R object name, with or without
#'    an R package prefix. Note that only one value is recognized.
#' @param pos,mode,inherits arguments passed to `base::get()`. Note
#'    that argument `envir` is not passed to `base::get()`, since
#'    it is typically defined dynamically by that function.
#' @param ... additional arguments are passed to `base::get()`. The
#'    only additional argument is `envir` which is not recommended,
#'    but is here for compatibility with the base functionality.
#' 
#' @examples 
#' get("get", pos="package:base")
#' 
#' better_get("base::get")
#' 
#' @export
better_get <- function
(x,
 pos=-1L,
 mode="any",
 inherits=TRUE,
 ...)
{
   if (jamba::igrepHas("^.+::.+$", x)) {
      xpackage <- paste0("package:", gsub("^(.+)::(.+)$", "\\1", x));
      xbase <- gsub("^(.+)::(.+)$", "\\2", x);
      xpos <- match(xpackage, search());
      if (length(xpos) == 0) {
         stop(paste0("better_get(): There is no package called '", 
            xpackage, 
            "' on the search() list."));
      }
      base::get(xbase,
         pos=xpos, 
         mode=mode,
         inherits=inherits,
         ...);
   } else {
      base::get(x,
         pos=pos,
         mode=mode,
         inherits=inherits,
         ...);
   }
}

#' Pattern replacement in a list of character vectors
#' 
#' Pattern replacement in a list of character vectors
#' 
#' This function is a simple wrapper around `base::gsub()` except
#' it operates on a list.
#' 
#' Note that this function assumes the input data contains vectors
#' and not embedded list objects.
#' 
#' @family jam list functions
#' 
#' @param pattern,replacement,ignore.case,perl,fixed,useBytes all
#'    arguments are passed to `base::gsub()` after `x` is converted
#'    to a character vector.
#' @param x `list` object that contains character vectors.
#' @param ... additional arguments are ignored.
#' 
#' @examples
#' 
#' x <- list(a=c("A", "B"), b=c("C,D"));
#' lgsub(",", "!:!", x)
#' 
#' @export
lgsub <- function
(pattern,
 replacement,
 x,
 ignore.case=FALSE,
 perl=FALSE,
 fixed=FALSE,
 useBytes=FALSE,
 ...)
{
   ## Expand x
   xlen <- lengths(x);
   xsplit <- rep(seq_along(x), xlen);
   xexp <- unname(unlist(x));
   xexp_new <- gsub(pattern=pattern,
      replacement=replacement,
      x=xexp,
      perl=perl,
      fixed=fixed,
      useBytes=useBytes);
   x_out <- split(xexp_new, xsplit);
   if (length(names(x)) > 0) {
      names(x_out) <- names(x);
   }
   return(x_out);
}

#' Case-insensitive mget()
#' 
#' Case-insensitive mget()
#' 
#' This function is a lightweight wrapper around `base::mget()`
#' (and generics) that intends to allow case-insensitive matching.
#' It does so by converting all keys to lowercase, matching
#' lowercase input to these lowercase keys, then using the original
#' keys in native `base::mget()`.
#' 
#' One small change from `base::mget()` is the default
#' argument `ifnotfound=NA`.
#' 
#' This function secretly runs `mget()` using the unique lowercase
#' input values `x`, to reduce the number of queries. This implementation
#' is designed to help with extremely long and potentially highly duplicated
#' input values in `x`, in which case the change greatly reduces the time
#' to return results.
#' 
#' Note: This function returns the first matching lowercase
#' key, with the direct assumption that keys will not be duplicated
#' after converting to lowercase. Should this assumption become a
#' problem, please provide feedback and we will change the method
#' accordingly.
#' 
#' Note: For unknown reasons, the R method dispatch was not
#' behaving properly for objects of class `"AnnDbBimap"`, presumably
#' because the generic functions `AnnotationDbi::ls()` and
#' `AnnotationDbi::mget()` were written for class `"Bimap"`.
#' So when the input `envir` class contains `"Bimap"` the
#' direct function `AnnotationDbi::keys()` is called, and if it
#' fails for some reason, `AnnotationDbi::ls()` is called,
#' thes `AnnotationDbi::mget()` is called, otherwise the generic
#' `ls()` or `mget()` is called.
#' 
#' @return named `list` of objects found, or `NA` for objects
#'    that are not found.
#' 
#' @family jam utility functions
#' 
#' @param x character vector of object names.
#' @param envir,mode,ifnotfound,inherits arguments are passed to
#'    `base:mget()`.
#' 
#' @export
imget <- function
(x,
 envir=as.environment(-1L),
 mode="any",
 ifnotfound=NA,
 inherits=FALSE,
 verbose=FALSE,
 ...)
{
   ##
   if (jamba::igrepHas("Bimap", class(envir))) {
      keys <- tryCatch({
         AnnotationDbi::keys(envir,
            ...);
      }, error=function(e){
         AnnotationDbi::ls(envir);
      });
   } else {
      keys <- ls(envir);
   }
   xl <- tolower(x);
   xlu <- unique(xl);
   xmatch <- match(xl, xlu);
   
   ## Prepare an empty output list
   valuesu <- as.list(rep(NA, length(xlu)));
   names(valuesu) <- xlu;
   
   ## Match unique lowercase input to lowercase keys
   keymatch <- match(xlu, tolower(keys));
   keysfound <- !is.na(keymatch);
   if (any(keysfound)) {
      if (jamba::igrepHas("Bimap", class(envir))) {
         valuesfound <- AnnotationDbi::mget(keys[keymatch[keysfound]],
            envir, 
            ifnotfound=ifnotfound,
            inherits=inherits);
      } else {
         valuesfound <- mget(keys[keymatch[keysfound]],
            envir, 
            ifnotfound=ifnotfound,
            inherits=inherits);
      }
      valuesu[keysfound] <- valuesfound;
   }
   values <- valuesu[xmatch];
   values;
}
jmw86069/genejam documentation built on Sept. 19, 2022, 1:53 p.m.