predict.EvoWeaver: Make predictions with EvoWeaver objects

View source: R/EvoWeaver-class.R

predict.EvoWeaverR Documentation

Make predictions with EvoWeaver objects

Description

This S3 method predicts pairwise functional associations between gene groups encoded in a EvoWeaver object. This returns an object of type EvoWeb, which is essentially an adjacency matrix with some extra S3 methods to make printing cleaner.

Usage

## S3 method for class 'EvoWeaver'
predict(object, Method='Ensemble',
         Subset=NULL, Processors=1L,
         MySpeciesTree=SpeciesTree(object, Verbose=Verbose),
         PretrainedModel="KEGG",
         NoPrediction=FALSE,
         ReturnDataFrame=TRUE,
         Verbose=interactive(),
         CombinePVal=TRUE, ...)

Arguments

object

A EvoWeaver object

Method

Method(s) to use for prediction. This can be a character vector with multiple entries for predicting using multiple methods. See 'Details' for more information.

Subset

Subset of data to predict on. This can either be a vector or a 2xN matrix.

If a vector, prediction proceeds for all possible pairs of elements specified in the vector (either by name, for character vector, or by index, for numeric vector). For example, subset=1:3 will predict for pairs (1,2), (1,3), (2,3).

If a matrix, subset is interpreted as a matrix of pairs, where each row of the matrix specifies a pair to evaluate. These can also be specifed by name (character) or by index (numeric).

subset=rbind(c(1,2),c(1,3),c(2,3)) produces equivalent functionality to subset=1:3.

Processors

Number of cores to use for methods that support multithreaded execution. Setting to NULL or a negative value will use the value of detectCores(), or one core if the number of available cores cannot be determined. See Note for more information.

MySpeciesTree

Phylogenetic tree of all genomes in the dataset. Required for Method=c('RPContextTree', 'GLDistance', 'CorrGL', 'MoransI', 'Behdenna'). 'Behdenna' requires a rooted, bifurcating tree (other values of Method can handle arbitrary trees). Note that EvoWeaver can automatically infer a species tree if initialized with dendrogram objects.

PretrainedModel

A pretrained model for use with ensemble predictions. The default value is "KEGG", corresponding to a built-in ensemble model trained on the KEGG MODULE database. Alternative values allowed are "CORUM", for a built-in ensemble model trained on the CORUM database, or any user-trained model. See the examples for how to train an ensemble method to pass to PretrainedModel.

Has no effect if Method != 'Ensemble'.

NoPrediction

For Method='Ensemble', should data be returned prior to making predictions?

If TRUE, this will instead return a data.frame object with predictions from each algorithm for each pair. This dataframe is typically used to train an ensemble model.

If FALSE, EvoWeaver will return predictions for each pair (using user model if provided or a built-in otherwise).

ReturnDataFrame

Logical indicating whether to return a data.frame object or a list of EvoWeb objects. Defaults to TRUE. Setting this parameter to FALSE is not recommended for typical users.

Verbose

Logical indicating whether to print progress bars and messages. Defaults to TRUE.

CombinePVal

Logical indicating whether to combine scores and p-values or to return them as separate values. Defaults to TRUE.

...

Additional parameters for other predictors and consistency with generic.

Details

predict.EvoWeaver wraps several methods to create an easy interface for multiple prediction types. Method='Ensemble' is the default value, but each of the component analyses can also be accessed. Common arguments to Method include:

  • 'Ensemble': Ensemble prediction combining individual coevolutionary predictors. See Note below.

  • 'PhylogeneticProfiling': All Phylogenetic Profiling Algorithms used in the EvoWeaver manuscript.

  • 'PhylogeneticStructure': All EvoWeaver Phylogenetic Structure Methods

  • 'GeneOrganization': All EvoWeaver Gene Organization Methods

  • 'SequenceLevel': All EvoWeaver Sequence Level Methods used in the EvoWeaver manuscript.

Additional information and references for each prediction algorithm can be found at the following pages:

  • EvoWeaver Phylogenetic Profiling Methods

  • EvoWeaver Phylogenetic Structure Methods

  • EvoWeaver Gene Organization Methods

  • EvoWeaver Sequence-Level Methods

The standard return type is a data.frame object with one column per predictor and an additional two columns specifying the genes in each pair. If ReturnDataFrame=FALSE, this returns a EvoWeb object. See EvoWeb for more information. Use of this parameter is discouraged.

By default, EvoWeaver weights scores by their p-value to correct for spurious correlations. The returned scores are raw_score*(1-p_value). If CombinePVal=FALSE, EvoWeaver will instead return the raw score and the p-value separately. The resulting data.frame will have one column for the raw score (denoted METHOD.score) and one column for the p-value (denoted METHOD.pval). **Note: p-values are recorded as (1-p)**. Not all methods support returning p-values separately from the score; in this case, only a METHOD.score column will be returned.

Different methods require different types of input. The constructor EvoWeaver will notify the user which methods are runnable with the given data. Method Ensemble automatically selects the methods that can be run with the given input data.

See EvoWeaver for more information on input data types.

Complete listing of all supported methods (asterisk denotes a method used in Ensemble, if possible):

  • 'ExtantJaccard': Jaccard Index of Presence/Absence (P/A) profiles at extant leaves

  • 'Hamming': Hamming similarity of P/A profiles

  • * 'GLMI': MI of G/L profiles

  • 'PAPV': 1-p_value of P/A profiles

  • 'ProfDCA': Direct Coupling Analysis of P/A profiles

  • 'Behdenna': Analysis of Gain/Loss events following Behdenna et al. (2016)

  • 'CorrGL': Correlation of ancestral Gain/Loss events

  • * 'GLDistance': Score-based method based on distance between inferred ancestral Gain/Loss events

  • * 'PAJaccard': Centered Jaccard distance of P/A profiles with conserved clades collapsed

  • * 'PAOverlap': Conservation of ancestral states based on P/A profiles

  • * 'RPMirrorTree': MirrorTree using Random Projection for dimensionality reduction

  • * 'RPContextTree': MirrorTree with Random Projection correcting for species tree and P/A conservation

  • * 'GeneDistance': Co-localization analysis

  • * 'MoransI': Co-localization analysis using Moran's I for phylogenetic correction and significance

  • * 'OrientationMI': Mutual Information of Gene Relative Orientation

  • * 'GeneVector': Correlation of distribution of sequence level residues following Zhao et al. (2022)

  • * 'SequenceInfo': Mutual information of sites in multiple sequence alignment

Value

Returns a data.frame object where each row corresponds to a single prediction for a pair of gene groups. The first two columns contain the gene group identifiers for each pair, and the remaining columns contain each prediction.

If ReturnDataFrame=FALSE, the return type is a list of EvoWeb objects. See EvoWeb for more info.

Note

EvoWeaver's publication used a random forest model from the randomForest package for prediction. The next release of EvoWeaver will include multiple new built-in ensemble methods, but in the interim users are recommended to rely on randomForest or neuralnet. Planned algorithms are random forests and feed-forward neural networks. Feel free to contact me regarding other models you would like to see added.

If NumCores is set to NULL, EvoWeaver will use one less core than is detected, or one core if detectCores() cannot detect the number of available cores. This is because of a recurring issue on my machine where the R session takes all available cores and is then locked out of forking processes, with the only solution to restart the entire R session. This may be an issue specific to ARM Macs, but out of an abundance of caution I've made the default setting to be slightly slower but guarantee completion rather than risk bricking a machine.

If ReturnDataFrame=FALSE and CombinePVal=FALSE, the resulting EvoWeb objects will contain values of type 'complex'. For each value, the real part denotes the raw score, and the imaginary part denotes 1-p, with p the p-value.

Author(s)

Aidan Lakshman ahl27@pitt.edu

See Also

EvoWeaver

EvoWeb

EvoWeaver Phylogenetic Profiling Predictors

EvoWeaver Phylogenetic Structure Predictors

EvoWeaver Gene Organization Predictors

EvoWeaver Sequence-Level Predictors

Examples

##############
## Prediction with built-in model and data
###############

set.seed(555L)
exData <- get(data("ExampleStreptomycesData"))
ew <- EvoWeaver(exData$Genes[1:50], MySpeciesTree=exData$Tree)

# Subset isn't necessary but is faster for a working example
evoweb1 <- predict(ew, Subset=1:2)

# print out results as an adjacency matrix
if(interactive()) print(evoweb1)

###############
## Training own ensemble model
###############

datavals <- evoweb1[,-c(1,2,10)]
actual_values <- sample(c(0,1), nrow(datavals), replace=TRUE)
# This example just picks random numbers
# ***Do not do this for your own models***

# Make sure the actual values correspond to the right pairs!
datavals[,'y'] <- actual_values
myModel <- glm(y~., datavals[,-c(1,2)], family='binomial')

testEvoWeaverObject <- EvoWeaver(exData$Genes[51:60], MySpeciesTree=exData$Tree)
evoweb2 <- predict(testEvoWeaverObject,
                     PretrainedModel=myModel)

# Print result as a data.frame of pairwise scores
if(interactive()) print(evoweb2)

npcooley/SynExtend documentation built on Jan. 16, 2025, 10:28 a.m.