project: Learn a projection method for statistics and apply it

View source: R/project.R

project.characterR Documentation

Learn a projection method for statistics and apply it

Description

project is a generic function with two methods. If the first argument is a parameter name, project.character (alias: get_projector) defines a projection function from several statistics to an output statistic predicting this parameter. project.default (alias: get_projection) produces a vector of projected statistics using such a projection. project is particularly useful to reduce a large number of summary statistics to a vector of projected summary statistics, with as many elements as parameters to infer. This dimension reduction can substantially speed up subsequent computations. The concept implemented in project is to fit a parameter to the various statistics available, using machine-learning or mixed-model prediction methods. All such methods can be seen as nonlinear projection to a one-dimensional space. project.character is an interface that allows different projection methods to be used, provided they return an object of a class that has a defined predict method with a newdata argument.

deforest_projectors is an utility to reduce the saved size of objects containing ranger objects (reproject can be used to reverse this).

Usage

project(x,...)

## S3 method for building the projection 
## S3 method for class 'character'
project(x, stats, data, 
             trainingsize=  eval(Infusion.getOption("trainingsize")),
             train_cP_size= eval(Infusion.getOption("train_cP_size")), method, 
             methodArgs=eval(Infusion.getOption("proj_methodArgs")), 
             verbose=TRUE, keep_data=TRUE, ...)
get_projector(...) # alias for project.character

## S3 method for applying the projection
## Default S3 method:
project(x, projectors, use_oob=Infusion.getOption("use_oob"), 
                          is_trainset=FALSE, methodArgs=list(), ext_projdata, ...)
get_projection(...) # alias for project.default

##
deforest_projectors(object)

Arguments

x

The name of the parameter to be predicted, or a vector/matrix/list of matrices of summary statistics.

stats

Statistics from which the predictor is to be predicted

use_oob

Boolean: whether to use out-of-bag predictions for data used in the training set, when such oob predictions are available (i.e. for random forest methods). Default as controlled by the same-named package option, is TRUE. This by default involves a costly check on each row of the input x, whether it belongs to the training set, so it is better to set it to FALSE if you are sure x does not belong to the training set (for true data in particular). Alternatively the check can be bypassed if you are sure that x was used as the training set.

is_trainset

Boolean. In a project call, set it to TRUE if x was used as the training set, to bypass a costly check (see use_oob argument). The same logic may apply in a plot_proj call, except that it is not immediately obvious for users whether the full reference table in an object was used as the training set, so trying to save time by setting is_trainset=TRUE requires more insight.

data

A list of simulated empirical distributions, as produced by add_simulation, or a data frame with all required variables.

trainingsize, train_cP_size

Integers; for most projection methods (excluding "REML" but including "ranger") only trainingsize is taken into account: it gives the maximum size of the training set (and is infinite by default for "ranger" method). If the data have more rows the training set is randomly sampled from it. For the "REML" method, train_cP_size is the maximum size of the data used for estimation of smoothing parameters, and trainingsize is the maximum size of the data from which the predictor is built given the smoothing parameters.

method

character string: "REML", "GCV", or the name of a suitable projection function. The latter may be defined in another package, e.g. "ranger" or "randomForest", or predefined by ⁠Infusion⁠, or defined by the user. See Details for predefined functions and for defining new ones. The default method is "ranger" if this package is installed, and "REML" otherwise. Defaults may change in later versions, so it is advised to provide an explicit method to improve reproducibility.

methodArgs

A list of arguments for the projection method. For project.character, the ranger method is run with some default argument if no methodArgs are specified. Beware that a NULL methodArgs$splitrule is interpreted as the "extratrees" splitrule, whereas in a direct call to ranger, this would be interpreted as the "variance" splitrule. For project.default, the only methodArgs element handled is num.threads passed to predict.ranger (which can also be controlled globally by Infusion.options(nb_cores=.)).

For other methods, project kindly (tries to) assign values to the required arguments if they are absent from methodArgs, according to the following rules:

If "REML" or "GCV" methods are used (in which case methodArgs is completely ignored); or

if the projection method uses formula and data arguments (in particular if the formula is of the form response ~ var1 + var2 + ...; otherwise the formula should be provided through methodArgs). This works for example for methods based on nnet; or

if the projection method uses x and y arguments. This works for example for the (somewhat obsolete) method randomForest (though not with the generic function method="randomForest", but only with the internal function method="randomForest:::randomForest.default").

projectors

A list with elements of two possible forms: (1) <name>=<project result>, where the <name> must differ from any name of x and <project result> is the return object of a project call; or (2) <name>=NULL where <name> is the name of a variable (raw summary statistic) in x (such explicit NULLs are needed for any raw statistic to be retained in the projected data; see Value).

verbose

Whether to print some information or not. In particular, TRUE, true-vs.-predicted diagnostic plots will be drawn for projection methods “known” by Infusion (notably "ranger", "fastai.tabular.learner.TabularLearner", "keras::keras.engine.training.Model", "randomForest", "GCV", caret::train).

keep_data, ext_projdata

(experimental, and only when ranger is used). Setting keep_data=FALSE allows the input data to be removed from the return object of project.character (where they are otherwise part of its call element). This may be useful to save memory when multiple projections are based on the same data. However, as this data information is sometimes used, it must then be manually added as element projdata to the return value of infer_SLik_joint, and provided to project.default calls through the ext_projdata argument.

object

An object of class SLik_j.

...

Further arguments passed to or from other functions. Currently, they are passed to plot.

Details

The preferred project method is non-parametric regression by (variants of) the random forest method as implemented in ranger. It is the default method, if that package is installed. Alternative methods have been interfaced as detailed below, but the functionality of most interfaces is infrequently tested.

By default, the ranger call through project will use the split rule "extratrees", with some other controls also differing from the ranger package defaults. If the split rule "variance" is used, the default value of mtry used in the call is also distinct from the ranger default, but consistent with Breiman 2001 for regression tasks.

Machine learning methods such as random forests overfit, except if out-of-bag predictions are used. When they are not, the bias is manifest in the fact that using the same simulation table for learning the projectors and for other steps of the analyses tend to lead to too narrow confidence regions. This bias disappears over iterations of refine when the projectors are kept constant. Infusion avoid this bias by using out-of-bag predictions when relevant, when ranger and randomForest are used. But it provides no code handling that problem for other machine-learning methods. Then, users should cope with that problems, and at a minimum should not update projectors in every iteration (the “Gentle Introduction to Infusion may contain further information about this problem”).

Prediction can be based on a linear mixed model (LMM) with autocorrelated random effects, internally calling the fitme function with formula <parameter> ~ 1+ Matern(1|<stat1>+...+<statn>). This approach allows in principle to produce arbitrarily complex predictors (given sufficient input) and avoids overfitting in the same way as restricted likelihood methods avoids overfitting in LMM. REML methods are then used by default to estimate the smoothing parameters. However, faster methods are generally required.

To keep REML computation reasonably fast, the train_cP_size and trainingsize arguments determine respectively the size of the subset used to estimate the smoothing parameters and the size of the subset defining the predictor given the smoothing parameters. REML fitting is already slow for data sets of this size (particularly as the number of predictor variables increase).

If method="GCV", a generalized cross-validation procedure (Golub et al. 1979) is used to estimate the smoothing parameters. This is faster but still slow, so a random subset of size trainingsize is still used to estimate the smoothing parameters and generate the predictor.

Alternatively, various machine-learning methods can be used (see e.g. Hastie et al., 2009, for an introduction). A random subset of size trainingsize is again used, with a larger default value bearing the assumption that these methods are faster. Predefined methods include

  • "ranger", the default, a computationally efficient implementation of random forest;

  • "randomForest", the older default, probably obsolete now;

  • "neuralNet", a neural network method, using the train function from the caret package (probably obsolete too);

  • "fastai" deep learning using the fastai package;

  • "keras" deep learning using the keras package.

The last two interfaces may yet offer limited or undocumented control: using deep learning seems attractive but the benefits over "ranger" are not clear (notably, the latter provide out-of-bag predictions that avoid overfitting).

In principle, any object suitable for prediction could be used as one of the projectors, and Infusion implements their usage so that in principle unforeseen projectors could be used. That is, if predictions of a parameter can be performed using an object MyProjector of class MyProjectorClass, MyProjector could be used in place of a project result if predict.MyProjectorClass(object,newdata,...) is defined. However, there is no guarantee that this will work on unforeseen projection methods, as each method tends to have some syntactic idiosyncrasies. For example, if the learning method that generated the projector used a formula-data syntax, then its predict method is likely to request names for its newdata, that need to be provided through attr(MyProjector,"stats") (these names cannot be assumed to be in the newdata when predict is called through optim).

Value

project.character returns an object of the class returned by the called method (by default, a ranger object for the up-to-date workflow).

project.default returns an object of the same class and structure as the input x, containing the variables named in the projectors argument, each variable being a projected statistics inferred from the input summary statistics, or a summary statistic copied from the input x (if an explicit NULL projector was included for this statistic in the projectors argument).

deforest_projectors is used for its side effect (the contents of an environment within the input object been modifed), and returns a character string emphasizing this feature.

Note

See workflow examples in example_reftable and example_raw_proj.

References

Breiman, L. (2001). Random forests. Mach Learn, 45:5-32. <doi:10.1023/A:1010933404324>

Golub, G. H., Heath, M. and Wahba, G. (1979) Generalized Cross-Validation as a method for choosing a good ridge parameter. Technometrics 21: 215-223.

T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York, 2nd edition, 2009.

See Also

plot_proj and plot_importance for diagnostic plots of the projections.

Examples

  ## see Note for links to examples.

Infusion documentation built on Sept. 30, 2024, 9:16 a.m.

Related to project in Infusion...