expression_prediction: GAPGOM - expression_prediction()
In GAPGOM: GAPGOM (novel Gene Annotation Prediction and other GO Metrics)

Description Usage Arguments Details Value Examples

Predicts annotation of un-annotated genes based on existing Gene Ontology annotation data and correlated expression patterns.

expression_prediction(gene_id, expression_set, organism, ontology,
  enrichment_cutoff = 250, method = "combine", significance = 0.05,
  go_amount = 5, filter_pvals = FALSE, idtype = "ENTREZID",
  verbose = FALSE, id_select_vector = NULL, id_translation_df = NULL,
  go_data = NULL)

`gene_id`	gene rowname to be compared to the other GO terms.
`expression_set`	ExpressionSet class containing expression values and other useful information, see GAPGOM::f5_example_data documentation for further explanation of this type. If you want a custom ExpressionSet you have to define one yourself.
`organism`	where to be scanned genes reside in, this option is neccesary to select the correct GO DAG. Options are based on the org.db bioconductor package; http://www.bioconductor.org/packages/release/BiocViews.html#___OrgDb Following options are available: "fly", "mouse", "rat", "yeast", "zebrafish", "worm", "arabidopsis", "ecolik12", "bovine", "canine", "anopheles", "ecsakai", "chicken", "chimp", "malaria", "rhesus", "pig", "xenopus". Fantom5 data only has "human" and "mouse" available depending on the dataset.
`ontology`	desired ontology to use for prediction. One of three; "BP" (Biological process), "MF" (Molecular function) or "CC" (Cellular Component). Cellular Component is not included with the package's standard data and will thus yield no results.
`enrichment_cutoff`	cutoff number for the amount of genes to be enriched in the enrichment analysis. (default is 250)
`method`	which statistical method to use for the prediction, currently there are 5 available; "pearson", "spearman", "kendall", "fisher", "sobolev" and "combine".
`significance`	normalized p-values (fdr) that are below this number will be kept. has to be a float/double between 0-1. Default is 0.05
`go_amount`	minimal amount of gos that a result needs to have to be considered similar enough.
`filter_pvals`	filters pvalues that are equal to 0 (Default=FALSE).
`idtype`	idtype of the expression_data. If not correctly specified, error will specify available IDs. default="ENTREZID"
`verbose`	set to true for more informative/elaborate output.
`id_select_vector`	gene rowname(s) that you want to keep in the dataset. For example, let's say you need to only include protein coding genes. You then make a vector including only ids that are protein coding. Most importantly, this is used in the GO term enrichment. Meaning that this vector should only contain genes that are annotated in the GO databases.
`id_translation_df`	df with translations between ID and GOID. col1 = ID, col2 = GOID. (this may be generated with ".generate_translation_df()" but this is not officially supported. It might be useful for running anylyses on the same expressionset because it improves performance.)
`go_data`	from set_go_data function. A GoSemSim go_data object.

This function is specifically made for predicting lncRNA annotation by assuming "guilt by association". For instance, the expression data in this package is actually based on mRNA expression data, but correlated with lncRNA. This expression data is the used in combination with mRNA GO annotation to calculate similarity scores between GO terms,

The resulting dataframe with prediction of similar GO terms. These are ordered with respect to FDR values. The following columns will be in the dataframe; GOID - Gene Ontology ID, Ontology - Ontology type (MF or BP), FDR - False Positive Rate, Term - description of GOID, used_method - the used method to determine the ontology term similarity

# Example with default dataset, take a look at the data documentation
# to fully grasp what's going on with making of the filter etc. (Biobase 
# ExpressionSet)

library(Biobase)

# keep everything that is a protein coding gene (for annotation)
filter_vector <- pData(featureData(GAPGOM::expset))[(
pData(featureData(GAPGOM::expset))$GeneType=="protein_coding"),]$GeneID
# set gid and run.
gid <- "ENSG00000228630"

result <- GAPGOM::expression_prediction(gid, 
                                        GAPGOM::expset, 
                                        "human", 
                                        "BP",
                                        id_translation_df = 
                                          GAPGOM::id_translation_df,
                                        id_select_vector = filter_vector,
                                        method = "combine", verbose = TRUE, 
                                        filter_pvals = TRUE
)