roc: ROC
In PavlidisLab/ermineR: Gene set analysis with multifunctionality assessment

View source: R/testWrappers.R

roc	R Documentation

ROC

Description

The receiver operator characteristic (ROC) method is a fast, non-parametric alternative to the ORA and resampling methods for generating gene set scores from gene scores.

Usage

roc(
  scores,
  scoreColumn = 1,
  bigIsBetter = FALSE,
  logTrans = FALSE,
  annotation = NULL,
  aspects = c("Molecular Function", "Cellular Component", "Biological Process"),
  geneReplicates = c("mean", "best"),
  pAdjust = c("FDR", "Bonferroni"),
  geneSetDescription = "Latest_GO",
  customGeneSets = NULL,
  minClassSize = 20,
  maxClassSize = 200,
  output = NULL,
  return = TRUE
)

Arguments

`scores`	A data.frame. Rownames have to be gene identifiers (eg. probes, must be unique), followed by any number of columns. The column used for scoring is chosen by `scoreColumn`. See http://erminej.msl.ubc.ca/help/input-files/gene-scores/ for information abot how to specify scores. (for test = ORA, GSR and ROC)
`scoreColumn`	Integer or character. Which column of the `scores` data.frame to use as scores. Defaults to first column of `scores`. See http://erminej.msl.ubc.ca/help/input-files/gene-scores/ for details. (for test = ORA, GSR and ROC)
`bigIsBetter`	Logical. If TRUE large scores are considered to be higher. `FALSE` by default (as in p values).
`logTrans`	Logical. Should the data be -log10 transformed. Recommended for p values. `FALSE` by default
`annotation`	Annotation. A file path, a data.frame or a platform short name (eg. GPL127). If given a platform short name it will be downloaded from annotation repository of Pavlidis Lab (https://gemma.msl.ubc.ca/annots/). To get a list of available annotations, use `listGemmaAnnotations`. Note that if there is a file or folder with the same name as the platform name in the directory, that file will be read instead of getting a copy from Pavlidis Lab. If this file isn't a valid annotation file, the function will fail. If providing a custom annotation file, see `makeAnnotation` to do it from R or erminej.msl.ubc.ca/help/input-files/gene-annotations/ to do it manually. If you are providing a custom gene set, you can leave annotation as NULL
`aspects`	Character vector. Which Go aspects to include in the analysis. Can be in long form (eg. 'Molecular Function') or short form (eg. `c('M','C','B')`)
`geneReplicates`	What to do when genes have multiple scores in input file (due to multiple probes per gene)
`pAdjust`	Which multiple test correction method to use. Can be "FDR" or 'Westfall-Young' (slower).
`geneSetDescription`	"Latest_GO", a file path that leads to a GO XML or OBO file or a URL that leads to a go ontology file that ends with rdf-xml.gz. If you left annotation as NULL and provided customGeneSets, this argument is not required and will default to NULL. Otherwise, by default it'll be set to "Latest_GO" which downloads the latest available GO XML file. This option won't work without an internet connection. To get a frozen file that you can use later, see `goToday`, `goAtDate` and `getGoDates`. See http://erminej.msl.ubc.ca/help/input-files/gene-set-descriptions/ for details.
`customGeneSets`	Path to a directory that contains custom gene set files, paths to custom gene set files themselves or a named list of character strings. Use this option to create your own gene sets. If you provide directory you can specify probes or gene symbols to include in your gene sets. See http://erminej.msl.ubc.ca/help/input-files/gene-sets/ for information about format for this file. If you are providing a list, only gene symbols are accepted.
`minClassSize`	minimum class size
`maxClassSize`	maximum class size
`output`	Output file name.
`return`	If results should be returned. Set to FALSE if you only want a file

Details

The ROC is a well-known method for evaluating rankings of items, in this case genes. The ranking in this case comes from the gene scores. A gene set will get a good ROC if many genes in the gene set are near the top of the list.

The score measured for each gene set is the area under the ROC curve, a value between 0 and 1. If the genes in the gene set are randomly distributed in the ranking, you would expect a value near 0.5. Values near 1 indicate the genes in the gene set are near the top of the list, while values near 0 indicate the genes in the gene set are near the bottom of the list. In principle both values near 0 and near 1 are statistically significant, but p-values reported by ermineJ are based on the assumption that only the top of the list is of interest (e.g., we’re not considering “under-representation analysis”).

Unlike the other methods in ermineJ other than the PRC method, the ROC uses only the ranks of the gene scores. That is, all it cares about is the ordering of items obtained by your gene scores (e.g., t-test or fold-change), but doesn’t use the information about the relative values of the scores.

P-values for this analysis are computed using algorithms described in Breslin et al., 2004*. For more information on the ROC, you could do worse than reading the Wikipedia page http://en.wikipedia.org/wiki/Receiver_operator_characteristic.

Like other non-parametric techniques, using ranks costs some statistical power, but also makes fewer assumptions. Specifically, if you think the ordering of items in your data is more accurate than the actual p-values themselves, the ROC might be appropriate. The PRC method is similar in that it uses ranks, but puts more emphasis on genes in the set which are ranked very near the top. In contrast the ROC method looks at overall trends in the rankings.

Method overview taken from: http://erminej.msl.ubc.ca/help/tutorials/running-an-analysis-correlation/

Value

A list containing a "results" component and a "details" component. "results" is a data.frame containing the main output. The columns of this table are

Name: the name of the gene set
ID: the id of the gene set
NumProbes: the number of elements (e.g. probes) in the gene set.
NumGenes: the number of genes in the gene set.
RawScore: the raw statistic for the gene set. For explanations see this page
Pval: the p value for the gene set.
CorrectedPvalue: the corrected p pvalue. See this page for more information.
MFPvalue: pvalue after multifunctionality correction. Might be missing if correction was not performed.
CorrectedMFPvalue: Like CorrectedPvalue, but for the multifunctionality “corrected” pvalue.
Multifunctionality: How biased the genes in the set are towards multifunctional genes.
Same as: a list of gene sets which have the exact same members as this one. Such gene sets are not listed anywhere else.
GeneMembers: If you selected the “Include genes” option when saving, this will contain a list of the genes that are in the gene set, separated by “|”.