expression_prediction: GAPGOM - expression_prediction()

Description Usage Arguments Details Value Examples

View source: R/similarity_prediction.R

Description

Predicts annotation of un-annotated genes based on existing Gene Ontology annotation data and correlated expression patterns.

Usage

1
2
3
4
5
expression_prediction(gene_id, expression_set, organism, ontology,
  enrichment_cutoff = 250, method = "combine", significance = 0.05,
  go_amount = 5, filter_pvals = FALSE, idtype = "ENTREZID",
  verbose = FALSE, id_select_vector = NULL, id_translation_df = NULL,
  go_data = NULL)

Arguments

gene_id

gene rowname to be compared to the other GO terms.

expression_set

ExpressionSet class containing expression values and other useful information, see GAPGOM::f5_example_data documentation for further explanation of this type. If you want a custom ExpressionSet you have to define one yourself.

organism

where to be scanned genes reside in, this option is neccesary to select the correct GO DAG. Options are based on the org.db bioconductor package; http://www.bioconductor.org/packages/release/BiocViews.html#___OrgDb Following options are available: "fly", "mouse", "rat", "yeast", "zebrafish", "worm", "arabidopsis", "ecolik12", "bovine", "canine", "anopheles", "ecsakai", "chicken", "chimp", "malaria", "rhesus", "pig", "xenopus". Fantom5 data only has "human" and "mouse" available depending on the dataset.

ontology

desired ontology to use for prediction. One of three; "BP" (Biological process), "MF" (Molecular function) or "CC" (Cellular Component). Cellular Component is not included with the package's standard data and will thus yield no results.

enrichment_cutoff

cutoff number for the amount of genes to be enriched in the enrichment analysis. (default is 250)

method

which statistical method to use for the prediction, currently there are 5 available; "pearson", "spearman", "kendall", "fisher", "sobolev" and "combine".

significance

normalized p-values (fdr) that are below this number will be kept. has to be a float/double between 0-1. Default is 0.05

go_amount

minimal amount of gos that a result needs to have to be considered similar enough.

filter_pvals

filters pvalues that are equal to 0 (Default=FALSE).

idtype

idtype of the expression_data. If not correctly specified, error will specify available IDs. default="ENTREZID"

verbose

set to true for more informative/elaborate output.

id_select_vector

gene rowname(s) that you want to keep in the dataset. For example, let's say you need to only include protein coding genes. You then make a vector including only ids that are protein coding. Most importantly, this is used in the GO term enrichment. Meaning that this vector should only contain genes that are annotated in the GO databases.

id_translation_df

df with translations between ID and GOID. col1 = ID, col2 = GOID. (this may be generated with ".generate_translation_df()" but this is not officially supported. It might be useful for running anylyses on the same expressionset because it improves performance.)

go_data

from set_go_data function. A GoSemSim go_data object.

Details

This function is specifically made for predicting lncRNA annotation by assuming "guilt by association". For instance, the expression data in this package is actually based on mRNA expression data, but correlated with lncRNA. This expression data is the used in combination with mRNA GO annotation to calculate similarity scores between GO terms,

Value

The resulting dataframe with prediction of similar GO terms. These are ordered with respect to FDR values. The following columns will be in the dataframe; GOID - Gene Ontology ID, Ontology - Ontology type (MF or BP), FDR - False Positive Rate, Term - description of GOID, used_method - the used method to determine the ontology term similarity

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Example with default dataset, take a look at the data documentation
# to fully grasp what's going on with making of the filter etc. (Biobase 
# ExpressionSet)

library(Biobase)

# keep everything that is a protein coding gene (for annotation)
filter_vector <- pData(featureData(GAPGOM::expset))[(
pData(featureData(GAPGOM::expset))$GeneType=="protein_coding"),]$GeneID
# set gid and run.
gid <- "ENSG00000228630"

result <- GAPGOM::expression_prediction(gid, 
                                        GAPGOM::expset, 
                                        "human", 
                                        "BP",
                                        id_translation_df = 
                                          GAPGOM::id_translation_df,
                                        id_select_vector = filter_vector,
                                        method = "combine", verbose = TRUE, 
                                        filter_pvals = TRUE
)

GAPGOM documentation built on Nov. 8, 2020, 8:08 p.m.