topEnrichBySource: Subset enrichList for top enrichment results by source

topEnrichBySourceR Documentation

Subset enrichList for top enrichment results by source

Description

Subset enrichList for top enrichment results by source

Subset enrichList for top enrichment results by source

Usage

topEnrichBySource(
  enrichDF,
  n = 15,
  min_count = 1,
  p_cutoff = 1,
  sourceColnames = c("gs_cat", "gs_subcat"),
  sortColname = c("P-value", "pvalue", "qvalue", "padjust", "-GeneRatio", "-Count",
    "-geneHits"),
  countColname = c("gene_count", "count", "geneHits"),
  pvalueColname = c("P.Value", "pvalue", "FDR", "adj.P.Val", "qvalue"),
  directionColname = c("activation.z.{0,1}score", "z.{0,1}score"),
  direction_cutoff = 1,
  newColname = "EnrichGroup",
  curateFrom = NULL,
  curateTo = NULL,
  sourceSubset = NULL,
  sourceSep = "_",
  subsetSets = NULL,
  descriptionColname = c("Description", "Name", "Pathway"),
  nameColname = c("ID", "Name"),
  descriptionGrep = NULL,
  nameGrep = NULL,
  verbose = FALSE,
  ...
)

topEnrichListBySource(
  enrichList,
  n = 15,
  min_count = 1,
  p_cutoff = 1,
  sourceColnames = c("gs_cat", "gs_subcat"),
  sortColname = c("P-value", "pvalue", "qvalue", "padjust", "-GeneRatio", "-Count",
    "-geneHits"),
  countColname = c("gene_count", "count", "geneHits"),
  pvalueColname = c("P.Value", "pvalue", "FDR", "adj.P.Val", "qvalue"),
  directionColname = c("activation.z.{0,1}score", "z.{0,1}score"),
  direction_cutoff = 1,
  newColname = "EnrichGroup",
  curateFrom = NULL,
  curateTo = NULL,
  sourceSubset = NULL,
  sourceSep = "_",
  subsetSets = NULL,
  descriptionColname = c("Description", "Name", "Pathway"),
  nameColname = c("ID", "Name"),
  descriptionGrep = NULL,
  nameGrep = NULL,
  verbose = FALSE,
  ...
)

Arguments

enrichDF

data.frame containing enrichment results.

n

integer maximum number of pathways to retain, after applying min_count and p_cutoff thresholds if relevant.

min_count

integer minimum number of genes involved in an enrichment result to be retained, based upon values in countColname.

p_cutoff

numeric value indicating the enrichment P-value threshold, pathways with enrichment P-value at or below this threshold are retained, based upon values in pvalueColname.

sourceColnames

character vector of colnames in enrichDF to consider as the "Source". Multiple columns will be combined using delimiter argument sourceSep. When sourceColnames is NULL or contains no colnames(enrichDF), then data is considered "All".

sortColname

character vector indicating the colnames to use to sort data, prior to selecting the top n results by source. This argument is passed to jamba::mixedSortDF(x, byCols=sortColname). Columns can be sorted in reverse order by using the prefix "-", as described in jamba::mixedSortDF().

countColname

character vector of possible colnames in enrichDF that should contain the integer number of genes involved in enrichment. This vector is passed to find_colname() to find an appropriate matching colname in enrichDF.

pvalueColname

character vector of possible colnames in enrichDF that should contain the enrichment P-value used for filtering by p_cutoff.

newColname

new column name to use when sourceColname matches multiple colnames in enrichDF. Values for each row are combined using jamba::pasteByRow().

curateFrom, curateTo

character vectors with pattern,replacement values, passed to gsubs() to allow some editing of values. The default values convert MSigDB canonical pathways from the prefix "CP:" to use "CP" which has the effect of combining all canonical pathways before selecting the top n results.

sourceSubset

character vector with a subset of sources to retain. If there are multiple colnames in sourceColnames, then column values are combined using jamba::pasteByRow() and delimiter sourceSep, prior to filtering.

sourceSep

character string used as a delimiter when sourceColnames contains multiple colnames.

descriptionColname, nameColname

character vectors indicating the colnames to consider description and name, as returned from find_colname(). These arguments are used only when descriptionGrep or nameGrep are supplied.

descriptionGrep, nameGrep

character vector of patterns, used to filter pathways to those matching one or more patterns. This argument is used to help extract a specific subset of pathways of interest using keywords. The descriptionGrep argument searches only descriptionColname; the nameGrep argument searches only nameColname.

verbose

logical indicating whether to print verbose output.

...

additional arguments are ignored.

enrichList

list of enrichDF entries, each passed to topEnrichBySource().

Details

This function takes one enrichResult object, or a data.frame of enrichment results, and determines the top n number of pathways sorted by P-values, within each pathway source. This function may optionally require min_count genes in each pathway, and p_cutoff maximum enrichment P-value, prior to taking the top topEnrichN entries. The default arguments do not apply filters to min_count and p_cutoff.

When the enrichment data represents pathways from multiple sources, the filtering and sorting is applied to each source independently. The intent is to retain the top entries from each source, as a method of representing each source consistently even when one source may contain many more pathways, and importantly where the range of enrichment P-values may be very different for each source. For example, a database of small canonical pathways would generally provide less statistically significant P-values than a database of dysregulated genes from gene expression experiments, where each set contains a large number of genes.

This function can optionally apply basic curation of pathway source names, and can optionally be applied to multiple source columns. This feature is intended for sources like MSigDB (see http://software.broadinstitute.org/gsea/msigdb/index.jsp) which contains columns "Source" and "Category", and where canonical pathways are either represented with "CP" or a prefix "CP:". The default parameters recognize this case and curates all prefix "CP:.*" down to just "CP" so that all canonical pathways are considered to be the same source. For MSigDB there are also numerous other sources, which are each independently filtered and sorted to the top topEnrichN entries.

Finally, this function is useful to subset enrichment results by name, using descriptionGrep or nameGrep.

topEnrichListBySource() extends topEnrichBySource() by applying filters to each enrichList entry, then keeping pathways across all enrichList that match the filter criteria in any one enrichList. It is most useful in the context of multiEnrichMap() where a pathway must meet all criteria in at least one enrichment, and that pathway should then be included for all enrichments for the purpose of comparative analysis.

Value

data.frame subset up to topEnrichN rows, after applying optional min_count and p_cutoff filters.

See Also

Other jam enrichment functions: multiEnrichMap()

Other jam enrichment functions: multiEnrichMap()


jmw86069/multienrichjam documentation built on Feb. 7, 2024, 12:58 a.m.