GridSearch: Computes AIC for a given model or models on a fixed set of...

Description Usage Arguments Details Value Note Author(s)

View source: R/phrapl-search.R

Description

This function takes a given model or models and a grid of parameters and returns the AIC at each parameter. In testing, it was found to work better than traditional optimization approaches. However, it can still do a heuristic search.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
GridSearch(modelRange = c(1:length(migrationArray)), migrationArrayMap=NULL, 
migrationArray, popAssignments, badAIC = 1e+14, 
maxParameterValue = 100, nTrees = 1e5, msPath = "ms", 
comparePath = system.file("extdata","comparecladespipe.pl", package = "phrapl"),
observedTrees, subsampleWeights.df = NULL, doWeights = TRUE,
unresolvedTest = TRUE, print.ms.string = FALSE, print.results = FALSE, 
print.matches = FALSE, debug = FALSE, method = "nlminb", itnmax = NULL, 
ncores = 1, results.file = NULL, maxtime = 0, maxeval = 0, return.all = TRUE, 
numReps = 0, startGrid = NULL, 
collapseStarts = c(0.3, 0.58, 1.11, 2.12, 4.07, 7.81, 15), 
n0Starts = c(0.1, 0.5, 1, 2, 4), 
migrationStarts = c(0.1, 0.22, 0.46, 1, 2.15), 
subsamplesPerGene = 1, totalPopVector = NULL, summaryFn = "mean", 
saveNoExtrap = FALSE, doSNPs = FALSE, nEq=100, setCollapseZero=NULL,
dAIC.cutoff=2, rm.n0 = TRUE, popScaling = NULL, checkpointFile = NULL, ...)

Arguments

modelRange

Integer vector: which models to examine. Do not specify to use default of all models.

migrationArrayMap

A data.frame containing information about all the models. Only required for heuristic search, not for grid search.

migrationArray

List containing all the models

popAssignments

A list of vectors (typically only one vector will be specified) that define the number of individuals per population included in the observed tree file (usually these will be subsampled trees). Defining popAssignments as list(c(4,4,4)) for example means that there 12 tips per observed tree, with 4 tips per population.

badAIC

In case of failure (such as trying a parameter outside a bound), this allows returning of suboptimal but still finite number. Mostly used for heuristic searches.

maxParameterValue

A bound for the maximum value for any parameter.

nTrees

Integer: the number of trees to simulate in ms.

msPath

Path to the local installation of ms; typing this string on the command line should result in ms running.

comparePath

Path to the local placement of the compareCladesPipe.pl perl script, including that script name.

observedTrees

Multiphylo object of the empirical trees.

subsampleWeights.df

A dataframe of the weights for each subsample. If this is NULL, it is computed within GridSearch.

doWeights

In no subsampleWeights.df object is called and doWeights = TRUE, subsample weights will be calculated for each tree prior to AIC calculation.

unresolvedTest

Boolean: deal with unresolved gene trees by looking for partial matches and correcting for that.

print.ms.string

Mostly for debugging, Boolean on whether to verbosely print out the calls to ms.

print.results

Mostly for debugging, Boolean on whether to verbosely print out the results.

print.matches

Mostly for debugging, Boolean on whether to verbosely print out the matches.

debug

Whether to print out additional debugging information.

method

For heuristic searches, which method to use. ?optim for more information.

itnmax

For heuristic searches, how many steps.

ncores

Allows running on multiple cores. Not implemented yet.

results.file

File name for storage of results.

maxtime

Maximum run time for heuristic search.

maxeval

Maximum number of function evaluations to run for heuristic search.

return.all

Boolean: return just the AIC scores or additional information.

numReps

For heuristic searches, number of starting points to try.

startGrid

Starting grid of parameters to try. Leave NULL to let program create this.

collapseStarts

Vector of starting values for collapse parameters.

n0Starts

Vector of starting values for n0.

migrationStarts

Vector of starting values for migration rates.

subsamplesPerGene

How many subsamples to take per gene

totalPopVector

Overall number of samples in each population before subsampling.

summaryFn

Way to summarize results across subsamples.

saveNoExtrap

Boolean to tell whether to save extrapolated values. FALSE by default.

doSNPs

Boolean to tell whether to use the SNPs model: count a single matching edge on the simulated tree as a full match. FALSE by default.

nEq

If no simulated trees match the observed trees, the frequentist estimate of the matching proportion is exactly zero (flip a coin 5 times, see no heads, so estimate probability of heads is zero). This would have an extreme effect: if any gene trees don't match, the model has no likelihood. A better approach is to realize that a finite set of samples gives finite information. We assume that the probability of a match, absent data, is 1/number of possible gene trees, and this is combined with the empirical estimate to give an estimate of the likelihood. A question is how much weight to put on this pre-existing estimate, and that is set by nEq. With its default of 100, very low weight is placed on this: it's equivalent in the info present in nTrees=100, and since the actual nTrees is 10000 or more, it has very little impact.

setCollapseZero

A vector of collapse parameters that will be set to zero (e.g., c(1,2) will set both the first and second collapse parameters to zero). K will be adjusted automatically to account for the specified fixed parameters.

dAIC.cutoff

A value specifying how optimized parameter values should be selected. Parameter estimates are calculated by taking the mean parameter value across all values within an AIC distance of dAIC.cutoff relative to the lowest AIC value. The default is 2 AIC points.

rm.n0

A boolean indicating whether n0multiplier parameter estimates should be outputted with the other estimates.

popScaling

A vector whose length is equal to the number of loci analyzed that gives the relative scaling of effective population size for each locus (e.g., diploid nuclear locus = 1, X-linked locus = 0.75, mtDNA locus = 0.25). Default is equal scaling for all loci.

checkpointFile

Results can be printed to a file which acts as a checkpoint. If a job is stopped during a GridSearch, rather than starting the analysis over, if a file was previously specified using this argument, the search will resume at the most recent iteration printed to the file. Currently, this argument can only be used when running a single model in GridSearch. However, if GridSearch is taking a long time when running multiple models at once, one can save results more often by reducing the number of models run at a time. Thus, the checkpointFile argument is only really necessary when you've got a single model running that can't be broken up any further.

...

Other items to pass to heuristic search functions.

Details

We recommend using the grid, not the heuristic search. If you are using the SNPs model, you should have uniform weights for the gene trees unless there is some reason you want to weight them differently.

Value

If return.all==FALSE, just a vector of AIC values. Otherwise, a list with parameters used for the grid and AIC for each, if using grid search.

Note

For more information, please see the user manual.

Author(s)

Brian O'Meara & Nathan Jackson


phrapl documentation built on May 2, 2019, 4:52 p.m.