overfitRR: Testing RRphylo methods overfit

View source: R/overfitRR.R

overfitRRR Documentation

Testing RRphylo methods overfit

Description

Testing the robustness of search.trend (Castiglione et al. 2019a), search.shift (Castiglione et al. 2018), search.conv (Castiglione et al. 2019b), and PGLS_fossil results to sampling effects and phylogenetic uncertainty.

Usage

overfitRR(RR,y,phylo.list=NULL,s=0.25,swap.args=NULL,trend.args=NULL,shift.args=NULL,
conv.args=NULL, pgls.args=NULL,aces=NULL,x1=NULL,aces.x1=NULL,cov=NULL,
rootV=NULL,nsim=100,clus=0.5)

Arguments

RR

an object produced by RRphylo.

y

a named vector of phenotypes.

phylo.list

a list (or multiPhylo) of alternative phylogenies to be tested.

s

the percentage of tips to be cut off. It is set at 25% by default. If phylo.list is provided, this argument is ignored.

swap.args

a list of arguments to be passed to the function swapONE, including list(si=NULL,si2=NULL, node=NULL). If swap.arg is unspecified, the function automatically sets both si and si2 to 0.1. If phylo.list is provided, swapping is not performed.

trend.args

a list of arguments specific to the function search.trend, including list(node=NULL,x1.residuals=FALSE). If a trend for the whole tree is to be tested, type trend.args = list(). No trend is tested if left unspecified.

shift.args

a list of arguments specific to the function search.shift, including list(node=NULL, state=NULL). Arguments node and state can be specified at the same time.

conv.args

a list of arguments specific to the function search.conv, including list(node=NULL, state=NULL, declust=FALSE). Arguments node and state can be specified at the same time.

pgls.args

a list of arguments specific to the function PGLS_fossil, including list(modform, data, tree=FALSE,RR=TRUE,...). If tree=TRUE, PGLS_fossil is performed by using the RRphylo output tree as tree argument. If RR=TRUE, PGLS_fossil is performed by using the RRphylo output as RR argument. Arguments tree and RR can be TRUE at the same time. ... are further argument passed to PGLS_fossil.

aces

if used to produce the RR object, the vector of those ancestral character values at nodes known in advance must be specified. Names correspond to the nodes in the tree.

x1

the additional predictor to be specified if the RR object has been created using an additional predictor (i.e. multiple version of RRphylo). 'x1' vector must be as long as the number of nodes plus the number of tips of the tree, which can be obtained by running RRphylo on the predictor as well, and taking the vector of ancestral states and tip values to form the x1.

aces.x1

a named vector of ancestral character values at nodes for x1. It must be indicated if the RR object has been created using both aces and x1. Names correspond to the nodes in the tree.

cov

if used to produce the RR object, the covariate must be specified. As in RRphylo, the covariate vector must be as long as the number of nodes plus the number of tips of the tree, which can be obtained by running RRphylo on the covariate as well, and taking the vector of ancestral states and tip values to form the covariate.

rootV

if used to produce the RR object, the phenotypic value at the tree root must be specified.

nsim

number of simulations to be performed. It is set at 100 by default.

clus

the proportion of clusters to be used in parallel computing. To run the single-threaded version of overfitRR set clus = 0.

Details

Methods using a large number of parameters risk being overfit. This usually translates in poor fitting with data and trees other than the those originally used. With RRphylo methods this risk is usually very low. However, the user can assess how robust the results got by applying search.shift, search.trend, search.conv or PGLS_fossil are by running overfitRR. With the latter, the original tree and data are subsampled by specifying a s parameter, that is the proportion of tips to be removed from the tree. In some cases, though, removing as many tips as imposed by s would delete too many tips right in clades and/or states under testing. In these cases, the function maintains no less than 5 species at least in each clade/state under testing (or all species if there is less), reducing the sampling parameter s if necessary. Internally, overfitRR further shuffles the tree by using the function swapONE. Thereby, both the potential for overfit and phylogenetic uncertainty are accounted for straight away.

Otherwise, a list of alternative phylogenies can be supplied to overfitRR. In this case subsampling and swapping arguments are ignored, and robustness testing is performed on the alternative topologies as they are. If a clade has to be tested either in search.shift, search.trend, or search.conv, the function scans each alternative topology searching for the corresponding clade. If the species within such clade on the alternative topology differ more than 10 species within the clade in the original tree, the identity of the clade is considered disrupted and the test is not performed.

Value

The function returns a 'RRphyloList' object containing:

$mean.sampling the mean proportion of species actually removed from the tree over the iterations.

$tree.list a 'multiPhylo' list including the trees generated within overfitRR

$RR.list a 'RRphyloList' including the results of each RRphylo performed within overfitRR

$rootCI the 95% confidence interval around the root value.

$ace.regressions a 'RRphyloList' including the results of linear regression between ancestral state estimates before and after the subsampling.

$conv.results a list including results for search.conv performed under clade and state conditions. If a node pair is specified within conv.args, the $clade object contains the percentage of simulations producing significant p-values for convergence between the clades, and the proportion of tested trees (i.e. where the clades identity was preserved; always 1 if no phylo.list is supplied). If a state vector is supplied within conv.args, the object $state contains the percentage of simulations producing significant p-values for convergence within (single state) or between states (multiple states).

$shift.results a list including results for search.shift performed under clade and sparse conditions. If one or more nodes are specified within shift.args, the $clade object contains for each node the percentage of simulations producing significant p-value separated by shift sign, and the same figures by considering all the specified nodes as evolving under a single rate (all.clades). For each node the proportion of tested trees (i.e. where the clade identity was preserved; always 1 if no phylo.list is supplied) is also indicated. If a state vector is supplied within shift.args, the object $sparse contains the percentage of simulations producing significant p-value separated by shift sign ($p.states).

$trend.results a list including the percentage of simulations showing significant p-values for phenotypes versus age and absolute rates versus age regressions for the entire tree separated by slope sign ($tree). If one or more nodes are specified within trend.args, the list also includes the same results at nodes ($node) and the results for comparison between nodes ($comparison). For each node the proportion of tested trees (i.e. where the clade identity was preserved; always 1 if no phylo.list is supplied) is also indicated.

$pgls.results two 'RRphyloList' objects including results of PGLS_fossil performed by using the phylogeny as it is ($tree) or rescaled according to the RRphylo rates ($RR).

Author(s)

Silvia Castiglione, Carmela Serio, Pasquale Raia

References

Castiglione, S., Tesone, G., Piccolo, M., Melchionna, M., Mondanaro, A., Serio, C., Di Febbraro, M., & Raia, P. (2018). A new method for testing evolutionary rate variation and shifts in phenotypic evolution. Methods in Ecology and Evolution, 9: 974-983.doi:10.1111/2041-210X.12954

Castiglione, S., Serio, C., Mondanaro, A., Di Febbraro, M., Profico, A., Girardi, G., & Raia, P. (2019a) Simultaneous detection of macroevolutionary patterns in phenotypic means and rate of change with and within phylogenetic trees including extinct species. PLoS ONE, 14: e0210101. https://doi.org/10.1371/journal.pone.0210101

Castiglione, S., Serio, C., Tamagnini, D., Melchionna, M., Mondanaro, A., Di Febbraro, M., Profico, A., Piras, P.,Barattolo, F., & Raia, P. (2019b). A new, fast method to search for morphological convergence with shape data. PLoS ONE, 14, e0226949. https://doi.org/10.1371/journal.pone.0226949

See Also

overfitRR vignette ; search.trend vignette ; search.shift vignette ; search.conv vignette ;

Examples

## Not run: 
data("DataOrnithodirans")
DataOrnithodirans$treedino->treedino
DataOrnithodirans$massdino->massdino
DataOrnithodirans$statedino->statedino
cc<- 2/parallel::detectCores()

# Extract Pterosaurs tree and data
library(ape)
extract.clade(treedino,746)->treeptero
massdino[match(treeptero$tip.label,names(massdino))]->massptero
massptero[match(treeptero$tip.label,names(massptero))]->massptero


RRphylo(tree=treedino,y=massdino,clus=cc)->dinoRates
RRphylo(tree=treeptero,y=log(massptero),clus=cc)->RRptero

# Case 1 search.shift under both "clade" and "sparse" condition
search.shift(RR=dinoRates, status.type= "clade")->SSnode
search.shift(RR=dinoRates, status.type= "sparse", state=statedino)->SSstate

overfitRR(RR=dinoRates,y=massdino,swap.args =list(si=0.2,si2=0.2),
          shift.args = list(node=rownames(SSnode$single.clades),state=statedino),
          nsim=10,clus=cc)->orr.ss

# Case 2 search.trend on the entire tree
search.trend(RR=RRptero, y=log(massptero),nsim=100,clus=cc,cov=NULL,node=NULL)->STtree

overfitRR(RR=RRptero,y=log(massptero),swap.args =list(si=0.2,si2=0.2),
          trend.args = list(),nsim=10,clus=cc)->orr.st1

# Case 3 search.trend at specified nodescov=NULL,
search.trend(RR=RRptero, y=log(massptero),node=143,clus=cc)->STnode

overfitRR(RR=RRptero,y=log(massptero),
          trend.args = list(node=143),nsim=10,clus=cc)->orr.st2

# Case 4 overfitRR on multiple RRphylo
data("DataCetaceans")
DataCetaceans$treecet->treecet
DataCetaceans$masscet->masscet
DataCetaceans$brainmasscet->brainmasscet
DataCetaceans$aceMyst->aceMyst

ape::drop.tip(treecet,treecet$tip.label[-match(names(brainmasscet),
                                               treecet$tip.label)])->treecet.multi
masscet[match(treecet.multi$tip.label,names(masscet))]->masscet.multi

RRphylo(tree=treecet.multi,y=masscet.multi,clus=cc)->RRmass.multi
RRmass.multi$aces[,1]->acemass.multi
c(acemass.multi,masscet.multi)->x1.mass

RRphylo(tree=treecet.multi,y=brainmasscet,x1=x1.mass,clus=cc)->RRmulti
search.trend(RR=RRmulti, y=brainmasscet,x1=x1.mass,clus=cc)->STcet
overfitRR(RR=RRmulti,y=brainmasscet,trend.args = list(),
          x1=x1.mass,nsim=10,clus=cc)->orr.st3

search.trend(RR=RRmulti, y=brainmasscet,x1=x1.mass,x1.residuals=TRUE,
             clus=cc)->STcet.resi
overfitRR(RR=RRmulti,y=brainmasscet,trend.args = list(x1.residuals=TRUE),
          x1=x1.mass,nsim=10,clus=cc)->orr.st4

# Case 5 searching convergence between clades and within a single state
data("DataFelids")
DataFelids$PCscoresfel->PCscoresfel
DataFelids$treefel->treefel
DataFelids$statefel->statefel

RRphylo(tree=treefel,y=PCscoresfel,clus=cc)->RRfel
search.conv(RR=RRfel, y=PCscoresfel, min.dim=5, min.dist="node9",clus=cc)->SC.clade
as.numeric(c(rownames(SC.clade[[1]])[1],as.numeric(as.character(SC.clade[[1]][1,1]))))->conv.nodes

overfitRR(RR=RRfel, y=PCscoresfel,conv.args =
list(node=conv.nodes,state=statefel,declust=TRUE),nsim=10,clus=cc)->orr.sc

# Case 6 overfitRR on PGLS_fossil
library(phytools)
rtree(100)->tree
fastBM(tree)->resp
fastBM(tree,nsim=3)->resp.multi
fastBM(tree)->pred1
fastBM(tree)->pred2

PGLS_fossil(modform=y1~x1+x2,data=list(y1=resp,x2=pred1,x1=pred2),tree=tree)->pgls_noRR

RRphylo(tree,resp,clus=cc)->RR
PGLS_fossil(modform=y1~x1+x2,data=list(y1=resp,x2=pred1,x1=pred2),tree=tree,RR=RR)->pgls_RR

overfitRR(RR=RR,y=resp,
          pgls.args=list(modform=y1~x1+x2,data=list(y1=resp,x2=pred1,x1=pred2),
                         tree=TRUE,RR=TRUE),nsim=10,clus=cc)->orr.pgls1

PGLS_fossil(modform=y1~x1+x2,data=list(y1=resp.multi,x2=pred1,x1=pred2),tree=tree)->pgls2_noRR

RRphylo(tree,resp.multi,clus=cc)->RR
PGLS_fossil(modform=y1~x1+x2,data=list(y1=resp.multi,x2=pred1,x1=pred2),tree=tree,RR=RR)->pgls2_RR

overfitRR(RR=RR,y=resp.multi,
          pgls.args=list(modform=y1~x1+x2,data=list(y1=resp.multi,x2=pred1,x1=pred2),
                         tree=TRUE,RR=TRUE),nsim=10,clus=cc)->orr.pgls2



## End(Not run)

RRphylo documentation built on June 7, 2023, 5:49 p.m.