lm_filter: Linear model filter

View source: R/filters.R

lm_filterR Documentation

Linear model filter

Description

Linear models are fitted on each predictor, with inclusion of variable names listed in force_vars in the model. Predictors are ranked by Akaike information criteria (AIC) value, or can be filtered by the p-value on the estimate of the coefficient for that predictor in its model.

Usage

lm_filter(
  y,
  x,
  force_vars = NULL,
  nfilter = NULL,
  p_cutoff = 0.05,
  rsq_cutoff = NULL,
  rsq_method = "pearson",
  type = c("index", "names", "full"),
  keep_factors = TRUE,
  method = 0L,
  mc.cores = 1
)

Arguments

y

Numeric or integer response vector

x

Matrix of predictors. If x is a data.frame it will be turned into a matrix. But note that factors will be reduced to numeric values, but a full design matrix is not generated, so if factors have 3 or more levels, it is recommended to convert x into a design (model) matrix first.

force_vars

Vector of column names x which are incorporated into the linear model.

nfilter

Number of predictors to return. If NULL all predictors with p-values < p_cutoff are returned.

p_cutoff

p-value cut-off. P-values are calculated by t-statistic on the estimated coefficient for the predictor being tested.

rsq_cutoff

r^2 cutoff for removing predictors due to collinearity. Default NULL means no collinearity filtering. Predictors are ranked based on AIC from a linear model. If 2 or more predictors are collinear, the first ranked predictor by AIC is retained, while the other collinear predictors are removed. See collinear().

rsq_method

character string indicating which correlation coefficient is to be computed. One of "pearson" (default), "kendall", or "spearman". See collinear().

type

Type of vector returned. Default "index" returns indices, "names" returns predictor names, "full" returns a matrix of p values.

keep_factors

Logical affecting factors with 3 or more levels. Dataframes are coerced to a matrix using data.matrix. Binary factors are converted to numeric values 0/1 and analysed as such. If keep_factors is TRUE (the default), factors with 3 or more levels are not filtered and are retained. If keep_factors is FALSE, they are removed.

method

Integer determining linear model method. See RcppEigen::fastLmPure()

mc.cores

Number of cores for parallelisation using parallel::mclapply().

Details

This filter is based on the model y ~ xvar + force_vars where y is the response vector, xvar are variables in columns taken sequentially from x and force_vars are optional covariates extracted from x. It uses RcppEigen::fastLmPure() with method = 0 as default since it is rank-revealing. method = 3 is significantly faster but can give errors in estimation of p-value with variables of zero variance. The algorithm attempts to detect these and set their stats to NA. NA in x are not tolerated.

Parallelisation is available via mclapply(). This is provided mainly for the use case of the filter being used as standalone. Nesting parallelisation inside of parallelised nestcv.glmnet() or nestcv.train() loops is not recommended.

Value

Integer vector of indices of filtered parameters (type = "index") or character vector of names (type = "names") of filtered parameters in order of linear model AIC. Any variables in force_vars which are incorporated into all models are listed first. If type = "full" a matrix of AIC value, sigma (residual standard error, see summary.lm), coefficient, t-statistic and p-value for each tested predictor is returned.


nestedcv documentation built on Oct. 26, 2023, 5:08 p.m.