Description Usage Arguments Details Value Warning Author(s) See Also Examples
Combines gene expression data with Gene Ontology (GO) annotations to rank and visualise genes and GO terms enriched for genes best clustering predefined groups of samples based on gene expression levels.
This methods semi-automatically retrieves the latest information from Ensembl
using the biomaRt
package, except if custom GO annotations are
provided. Custom GO annotations have two main benefits: firstly
they allow the analysis of species not supported in the Ensembl BioMart
server, and secondly they save time skipping calls to the Ensembl BioMart
server for species that are supported.
The latter also presents the possiblity of using an older release of the
Ensembl annotations, for example.
Using default settings, a random forest analysis is performed to evaluate the ability of each gene to cluster samples according to a predefined grouping factor (one-way ANOVA available as an alteranative). Each GO term is scored and ranked according to the average rank (alternatively, average power) of all associated genes to cluster the samples according to the factor. The ranked list of GO terms is returned, with tools allowing to visualise the statistics on a gene- and ontology-basis.
1 2 3 4 5 6 |
eSet |
|
f |
A column name in |
subset |
A named list to subset |
biomart_name |
The Ensembl BioMart database to which connect to. |
biomart_dataset |
The Ensembl BioMart dataset identifier corresponding to the species studied. If not specified and no custom annotations were provided, the method will attempt to automatically identify the adequate dataset from the first feature identifier in the dataset. Use data(prefix2dataset) to access a table listing valid choices. |
microarray |
The identifier in the Ensembl BioMart corresponding to the microarray
platform used. If not specified and no custom annotations were provided,
the method will attempt to
automatically identify the platform used from the first feature identifier
in the dataset.
Use |
method |
The statistical framework to score genes and gene ontologies. Either "randomForest" or "rf" to use the random forest algorithm, or alternatively either of "anova" or "a" to use the one-way ANOVA model. Default is "randomForest". |
rank.by |
Either of "rank" or "score" to chose the metric used to order the gene and GO term result tables. Default to 'rank'. |
do.trace |
Only used if method="randomForest". If set to TRUE, gives a more verbose output as randomForest is run. If set to some integer, then running output is printed for every do.trace trees. Default is 100. |
ntree |
Only used if method="randomForest". Number of trees to grow. This should be set to a number large enough to ensure that every input row gets predicted at least a few times |
mtry |
Only used if method="randomForest". Number of features randomly sampled as candidates at each split. Default value is 2*sqrt(gene_count) which is approximately 220 genes for a dataset of 12,000 genes. |
GO_genes |
Custom annotations associating features present in the expression dataset
to gene ontology identifiers. This must be provided as a data-frame of
two columns, named |
all_GO |
Custom annotations used to annotate each GO identifier present in
|
all_genes |
Custom annotations used to annotate each feature identifier in the
expression dataset with the gene name or symbol (e.g. "TNF"), and an
optional description. This must be provided as a data-frame containing at
least a column named |
FUN.GO |
Function to summarise the score and rank of all feature associated with
each gene ontology. Default is |
... |
Additional arguments passed on to the |
The default scoring functions strongly favor GO terms associated with fewer genes at the top of the ranking. This bias may actually be seen as a valuable feature which enables the user to browse through GO terms of increasing "granularity", i.e. associated with increasingly large sets of genes, although consequently being increasingly vague and blurry (e.g. "protein binding" molecular function associated with over 6,000 genes).
It is suggested to use the subset_scores()
function to subsequently
filter out GO terms with fewer than 5+ genes associated with it. Indeed,
those GO terms are more sensitive to outlier genes as they were scored on
the average of a handful of genes.
Additionally, the pValue_GO
function may be used to generate
a permutation-based P-value indicating the chance of seeing each GO term
reaching an equal or higher rank – or score – by chance.
A list containing the results of the analysis. Some elements are specific to the output of each analysis method.
Core elements:
GO |
A table ranking all GO terms related to genes in the expression dataset based on the average ability of their related genes to cluster the samples according to the predefined grouping factor. |
mapping |
The table mapping genes present in the dataset to GO terms. |
genes |
A table ranking all genes present in the expression dataset based on their ability to cluster the samples according to the predefined grouping factor. |
factor |
The predefined grouping factor. |
method |
The statistical framework used. |
subset |
The filters used to run the analysis only on a subet of the samples. NULL if no filter was applied. |
rank.by |
The metric used to rank order the genes and gene ontologies. |
FUN.GO |
The function used to summarise the score and rank of all gene features associated with each gene ontology. |
Random Forest additional elements:
ntree |
Number of trees grown. |
mtry |
Number of variables randomly sampled as candidates at each split. |
One-way ANOVA does not have additional arguments.
Make sure that the factor f
is an actual factor in the R language
meaning. This is important for the underlying statistical framework to
identify the groups of samples defined by their level of this factor.
If the column defining the factor (e.g. "Treatment") in phenodata
is not an R factor, use
pData(targets)$Treatment = factor(pData(targets)$Treatment)
to convert the character values into an actual R factor with appropriate
levels.
Kevin Rue-Albrecht
Methods
subset_scores
,
pValue_GO
,
getBM
,
randomForest
,
and oneway.test
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | # Load example data subset
data(AlvMac)
# Load a local copy of annotations obtained from the Ensembl Biomart server
data(AlvMac_GOgenes)
data(AlvMac_allgenes)
data(AlvMac_allGO)
# Run the analysis on factor "Treatment",
# considering only treatments "MB" and "TB" at time-point "48H"
# using a local copy of annotations obtained from the Ensembl BioMart server
AlvMac_results <- GO_analyse(
eSet=AlvMac, f="Treatment",
subset=list(Time=c("48H"), Treatment=c("MB", "TB")),
GO_genes=AlvMac_GOgenes, all_genes=AlvMac_allgenes, all_GO=AlvMac_allGO
)
# Valid Ensembl BioMart datasets are listed in the following variable
data(prefix2dataset)
# Valid microarray= values are listed in the following variable
data(microarray2dataset)
## Not run:
# Other valid but time-consuming examples:
# Run the analysis on factor "Treatment" including all samples
GO_analyse(eSet=AlvMac, f="Treatment")
# Run the analysis on factor "Treatment" using ANOVA method
GO_analyse(eSet=AlvMac, f="Treatment", method="anova")
# Use alternative GO scoring/summarisation functions (Default is: average)
# Named functions
GO_analyse(eSet=AlvMac, f="Treatment", FUN.GO = median)
# Anonymous functions (simple example without scientific value)
GO_analyse(eSet=AlvMac, f="Treatment", FUN.GO = function(x){median(x)/100})
# Syntax examples without actual data:
# To force the use of the Ensembl BioMart for the human species, use:
GO_analyse(eSet, f, biomart_dataset = "hsapiens_gene_ensembl")
# To force use of the bovine affy_bovine microarray annotations use:
GO_analyse(eSet, f, microarray = "affy_bovine")
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.