GO_analyse: Identifies gene ontologies clustering samples according to...
In kevinrue/GOexpress-release: Visualise microarray and RNAseq data using gene ontology annotations

Description Usage Arguments Details Value Warning Author(s) See Also Examples

Combines gene expression data with Gene Ontology (GO) annotations to rank and visualise genes and GO terms enriched for genes best clustering predefined groups of samples based on gene expression levels.

This methods semi-automatically retrieves the latest information from Ensembl using the biomaRt package, except if custom GO annotations are provided. Custom GO annotations have two main benefits: firstly they allow the analysis of species not supported in the Ensembl BioMart server, and secondly they save time skipping calls to the Ensembl BioMart server for species that are supported. The latter also presents the possiblity of using an older release of the Ensembl annotations, for example.

Using default settings, a random forest analysis is performed to evaluate the ability of each gene to cluster samples according to a predefined grouping factor (one-way ANOVA available as an alteranative). Each GO term is scored and ranked according to the average rank (alternatively, average power) of all associated genes to cluster the samples according to the factor. The ranked list of GO terms is returned, with tools allowing to visualise the statistics on a gene- and ontology-basis.

GO_analyse(
    eSet, f, subset=NULL, biomart_dataset="", microarray="",
    method="randomForest", rank.by="rank", do.trace=100, ntree=1000,
    mtry=ceiling(2*sqrt(nrow(eSet))), GO_genes=NULL, all_GO=NULL,
    all_genes=NULL, FUN.GO=mean, ...)

`eSet`	`ExpressionSet` of the `Biobase` package including a gene-by-sample expression matrix in the `AssayData` slot, and a phenotypic information data-frame in the `phenodate` slot. In the expression matrix, row names are identifiers of expressed features, and column names are identifiers of the individual samples. In the phenotypic data-frame, row names are sample idenfifiers, column names are grouping factors and phenotypic traits usable for statistical tests and visualisation methods.
`f`	A column name in `phenodata` used as the grouping factor for the analysis.
`subset`	A named list to subset `eSet` for the analysis. Names must be column names existing in colnames(pData(eSet)). Values must be vectors of values existing in the corresponding column of pData(eSet). The original ExpressionSet will be left unchanged.
`biomart_dataset`	The Ensembl BioMart dataset identifier corresponding to the species studied. If not specified and no custom annotations were provided, the method will attempt to automatically identify the adequate dataset from the first feature identifier in the dataset. Use data(prefix2dataset) to access a table listing valid choices.
`microarray`	The identifier in the Ensembl BioMart corresponding to the microarray platform used. If not specified and no custom annotations were provided, the method will attempt to automatically identify the platform used from the first feature identifier in the dataset. Use `data(microarray2dataset)` to access a table listing valid choices.
`method`	The statistical framework to score genes and gene ontologies. Either "randomForest" or "rf" to use the random forest algorithm, or alternatively either of "anova" or "a" to use the one-way ANOVA model. Default is "randomForest".
`rank.by`	Either of "rank" or "score" to chose the metric used to order the gene and GO term result tables. Default to 'rank'.
`do.trace`	Only used if method="randomForest". If set to TRUE, gives a more verbose output as randomForest is run. If set to some integer, then running output is printed for every do.trace trees. Default is 100.
`ntree`	Only used if method="randomForest". Number of trees to grow. This should be set to a number large enough to ensure that every input row gets predicted at least a few times
`mtry`	Only used if method="randomForest". Number of features randomly sampled as candidates at each split. Default value is 2*sqrt(gene_count) which is approximately 220 genes for a dataset of 12,000 genes.
`GO_genes`	Custom annotations associating features present in the expression dataset to gene ontology identifiers. This must be provided as a data-frame of two columns, named `gene_id` and `go_id`. If provided, no call to the Ensembl BioMart server will be done, and arguments `all_GO` and `all_genes` should be provided as well, to enable all downstream features of `GOexpress`. An example is provided in `AlvMac_GOgenes`.
`all_GO`	Custom annotations used to annotate each GO identifier present in `GO_genes` with the ontology name (e.g. "apoptotic process") and namespace (i.e. "biological_process", "molecular_function", or "cellular_component"). This must be provided as a data-frame containing at least one column named `go_id`, and preferably two more columns named `name_1006` and `namespace_1003` for consistency with the Ensembl BioMart. Supported alternative column headers are `name` and `namespace`. Respectively, `name` should be used to provide the description of the GO term, and `namespace` should contain one of "biological_process", "molecular_function" and "cellular_component". `name` is used to generate the title of ontology-based figured, and `namespace` is important to enable subsequent filtering of results by their corresponding value. An example is provided in `data(AlvMac_allGO)`.
`all_genes`	Custom annotations used to annotate each feature identifier in the expression dataset with the gene name or symbol (e.g. "TNF"), and an optional description. This must be provided as a data-frame containing at least a column named `gene_id` and preferably two more columns named `external_gene_name` and `description` for consistency with the Ensembl BioMart. A supported alternative header is `name`. While `external_gene_name` is important to enable subsequent visualisation of results by gene symbol, `description` is only displayed for readability of result tables. An example is provided in `data(AlvMac_allgenes)`.
`FUN.GO`	Function to summarise the score and rank of all feature associated with each gene ontology. Default is `mean` function. If using "lambda-like" (anonymous) functions, these must take a list of numeric values as an input, and return a single numeric value as an output.
`...`	Additional arguments passed on to the randomForest() method, if applicable.

The default scoring functions strongly favor GO terms associated with fewer genes at the top of the ranking. This bias may actually be seen as a valuable feature which enables the user to browse through GO terms of increasing "granularity", i.e. associated with increasingly large sets of genes, although consequently being increasingly vague and blurry (e.g. "protein binding" molecular function associated with over 6,000 genes).

It is suggested to use the subset_scores() function to subsequently filter out GO terms with fewer than 5+ genes associated with it. Indeed, those GO terms are more sensitive to outlier genes as they were scored on the average of a handful of genes.

Additionally, the pValue_GO function may be used to generate a permutation-based P-value indicating the chance of seeing each GO term reaching an equal or higher rank – or score – by chance.

A list containing the results of the analysis. Some elements are specific to the output of each analysis method.

Core elements:

`GO`	A table ranking all GO terms related to genes in the expression dataset based on the average ability of their related genes to cluster the samples according to the predefined grouping factor.
`mapping`	The table mapping genes present in the dataset to GO terms.
`genes`	A table ranking all genes present in the expression dataset based on their ability to cluster the samples according to the predefined grouping factor.
`factor`	The predefined grouping factor.
`method`	The statistical framework used.
`subset`	The filters used to run the analysis only on a subet of the samples. NULL if no filter was applied.
`rank.by`	The metric used to rank order the genes and gene ontologies.
`FUN.GO`	The function used to summarise the score and rank of all gene features associated with each gene ontology.

Random Forest additional elements:

`ntree`	Number of trees grown.
`mtry`	Number of variables randomly sampled as candidates at each split.

One-way ANOVA does not have additional arguments.

Make sure that the factor f is an actual factor in the R language meaning. This is important for the underlying statistical framework to identify the groups of samples defined by their level of this factor.

If the column defining the factor (e.g. "Treatment") in phenodata is not an R factor, use pData(targets)$Treatment = factor(pData(targets)$Treatment) to convert the character values into an actual R factor with appropriate levels.

Kevin Rue-Albrecht

Methods subset_scores, pValue_GO, getBM, randomForest, and oneway.test.

# Load example data subset
data(AlvMac)
# Load a local copy of annotations obtained from the Ensembl Biomart server
data(AlvMac_GOgenes)
data(AlvMac_allgenes)
data(AlvMac_allGO)

# Run the analysis on factor "Treatment",
# considering only treatments "MB" and "TB" at time-point "48H"
# using a local copy of annotations obtained from the Ensembl BioMart server
AlvMac_results <- GO_analyse(  
    eSet=AlvMac, f="Treatment",
    subset=list(Time=c("48H"), Treatment=c("MB", "TB")),
    GO_genes=AlvMac_GOgenes, all_genes=AlvMac_allgenes, all_GO=AlvMac_allGO
    )

# Valid Ensembl BioMart datasets are listed in the following variable
data(prefix2dataset)

# Valid microarray= values are listed in the following variable
data(microarray2dataset)

## Not run: 
# Other valid but time-consuming examples:

# Run the analysis on factor "Treatment" including all samples
GO_analyse(eSet=AlvMac, f="Treatment")

# Run the analysis on factor "Treatment" using ANOVA method
GO_analyse(eSet=AlvMac, f="Treatment", method="anova")


# Use alternative GO scoring/summarisation functions (Default is: average)

# Named functions
GO_analyse(eSet=AlvMac, f="Treatment", FUN.GO = median)

# Anonymous functions (simple example without scientific value)
GO_analyse(eSet=AlvMac, f="Treatment", FUN.GO = function(x){median(x)/100})


# Syntax examples without actual data:

# To force the use of the Ensembl BioMart for the human species, use:
GO_analyse(eSet, f, biomart_dataset = "hsapiens_gene_ensembl")

# To force use of the bovine affy_bovine microarray annotations use:
GO_analyse(eSet, f, microarray = "affy_bovine")

## End(Not run)