LEGIT_cv: Cross-validation for the LEGIT model

Description Usage Arguments Value References Examples

View source: R/LEGIT.R

Description

Uses cross-validation on the LEGIT model. Note that this is not a very fast implementation since it was written in R.

Usage

1
2
3
4
5
LEGIT_cv(data, genes, env, formula, cv_iter = 5, cv_folds = 10,
  folds = NULL, Huber_p = 1.345, classification = FALSE,
  start_genes = NULL, start_env = NULL, eps = 0.001, maxiter = 100,
  family = gaussian, ylim = NULL, seed = NULL, id = NULL,
  crossover = NULL, crossover_fixed = FALSE)

Arguments

data

data.frame of the dataset to be used.

genes

data.frame of the variables inside the genetic score G (can be any sort of variable, doesn't even have to be genetic).

env

data.frame of the variables inside the environmental score E (can be any sort of variable, doesn't even have to be environmental).

formula

Model formula. Use E for the environmental score and G for the genetic score. Do not manually code interactions, write them in the formula instead (ex: G*E*z or G:E:z).

cv_iter

Number of cross-validation iterations (Default = 5).

cv_folds

Number of cross-validation folds (Default = 10). Using cv_folds=NROW(data) will lead to leave-one-out cross-validation.

folds

Optional list of vectors containing the fold number for each observation. Bypass cv_iter and cv_folds. Setting your own folds could be important for certain data types like time series or longitudinal data.

Huber_p

Parameter controlling the Huber cross-validation error (Default = 1.345).

classification

Set to TRUE if you are doing classification (binary outcome).

start_genes

Optional starting points for genetic score (must be the same length as the number of columns of genes).

start_env

Optional starting points for environmental score (must be the same length as the number of columns of env).

eps

Threshold for convergence (.01 for quick batch simulations, .0001 for accurate results).

maxiter

Maximum number of iterations.

family

Outcome distribution and link function (Default = gaussian).

ylim

Optional vector containing the known min and max of the outcome variable. Even if your outcome is known to be in [a,b], if you assume a Gaussian distribution, predict() could return values outside this range. This parameter ensures that this never happens. This is not necessary with a distribution that already assumes the proper range (ex: [0,1] with binomial distribution).

seed

Seed for cross-validation folds.

id

Optional id of observations, can be a vector or data.frame (only used when returning list of possible outliers).

crossover

If not NULL, estimates the crossover point of E using the provided value as starting point (To test for diathesis-stress vs differential susceptibility).

crossover_fixed

If TRUE, instead of estimating the crossover point of E, we force/fix it to the value of "crossover". (Used when creating a diathes-stress model) (Default = FALSE).

Value

If classification = FALSE, returns a list containing, in the following order: a vector of the cross-validated R^2 at each iteration, a vector of the Huber cross-validation error at each iteration, a vector of the L1-norm cross-validation error at each iteration, a matrix of the possible outliers (standardized residuals > 2.5 or < -2.5) and their corresponding standardized residuals and standardized pearson residuals. If classification = TRUE, returns a list containing, in the following order: a vector of the cross-validated R^2 at each iteration, a vector of the Huber cross-validation error at each iteration, a vector of the L1-norm cross-validation error at each iteration, a vector of the AUC at each iteration, a matrix of the best choice of threshold (based on Youden index) and the corresponding specificity and sensitivity at each iteration, and a list of objects of class "roc" (to be able to make roc curve plots) at each iteration. The Huber and L1-norm cross-validation errors are alternatives to the usual cross-validation L2-norm error (which the R^2 is based on) that are more resistant to outliers, the lower the values the better.

References

Denis Heng-Yan Leung. Cross-validation in nonparametric regression with outliers. Annals of Statistics (2005): 2291-2310.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
## Not run: 
train = example_3way(250, 2.5, seed=777)
# Cross-validation 4 times with 5 Folds
cv_5folds = LEGIT_cv(train$data, train$G, train$E, y ~ G*E*z, cv_iter=4, cv_folds=5)
cv_5folds
# Leave-one-out cross-validation (Note: very slow)
cv_loo = LEGIT_cv(train$data, train$G, train$E, y ~ G*E*z, cv_iter=1, cv_folds=250)
cv_loo
# Cross-validation 4 times with 5 Folds (binary outcome)
train_bin = example_2way(500, 2.5, logit=TRUE, seed=777)
cv_5folds_bin = LEGIT_cv(train_bin$data, train_bin$G, train_bin$E, y ~ G*E, 
cv_iter=4, cv_folds=5, classification=TRUE, family=binomial)
cv_5folds_bin
par(mfrow=c(2,2))
pROC::plot.roc(cv_5folds_bin$roc_curve[[1]])
pROC::plot.roc(cv_5folds_bin$roc_curve[[2]])
pROC::plot.roc(cv_5folds_bin$roc_curve[[3]])
pROC::plot.roc(cv_5folds_bin$roc_curve[[4]])

## End(Not run)

LEGIT documentation built on June 24, 2018, 5:01 p.m.