stat_filter: Univariate filter for binary classification with mixed...
In nestedcv: Nested Cross-Validation with 'glmnet' and 'caret'

stat_filter

R Documentation

Univariate filter for binary classification with mixed predictor datatypes

Description

Univariate statistic filter for dataframes of predictors with mixed numeric and categorical datatypes. Different statistical tests are used depending on the data type of response vector and predictors:

Binary class response: bin_stat_filter(): t-test for continuous data, chi-squared test for categorical data
Multiclass response: class_stat_filter(): one-way ANOVA for continuous data, chi-squared test for categorical data
Continuous response: cor_stat_filter(): correlation (or linear regression) for continuous data and binary data, one-way ANOVA for categorical data

Usage

stat_filter(y, x, ...)

bin_stat_filter(
  y,
  x,
  force_vars = NULL,
  nfilter = NULL,
  p_cutoff = 0.05,
  rsq_cutoff = NULL,
  type = c("index", "names", "full", "list"),
  ...
)

class_stat_filter(
  y,
  x,
  force_vars = NULL,
  nfilter = NULL,
  p_cutoff = 0.05,
  rsq_cutoff = NULL,
  type = c("index", "names", "full", "list"),
  ...
)

cor_stat_filter(
  y,
  x,
  cor_method = c("pearson", "spearman", "lm"),
  force_vars = NULL,
  nfilter = NULL,
  p_cutoff = 0.05,
  rsq_cutoff = NULL,
  rsq_method = "pearson",
  type = c("index", "names", "full", "list"),
  ...
)

Arguments

`y`	Response vector
`x`	Matrix or dataframe of predictors
`...`	optional arguments, e.g. `rsq_method`: see `collinear()`.
`force_vars`	Vector of column names within `x` which are always retained in the model (i.e. not filtered). Default `NULL` means all predictors will be passed to `filterFUN`.
`nfilter`	Number of predictors to return. If `NULL` all predictors with p-values < `p_cutoff` are returned.
`p_cutoff`	p value cut-off
`rsq_cutoff`	r^2 cutoff for removing predictors due to collinearity. Default `NULL` means no collinearity filtering. Predictors are ranked based on t-test. If 2 or more predictors are collinear, the first ranked predictor by t-test is retained, while the other collinear predictors are removed. See `collinear()`.
`type`	Type of vector returned. Default "index" returns indices, "names" returns predictor names, "full" returns a dataframe of statistics, "list" returns a list of 2 matrices of statistics, one for continuous predictors, one for categorical predictors.
`cor_method`	For `cor_stat_filter()` only, either `"pearson"`, `"spearman"` or `"lm"` controlling whether continuous predictors are filtered by correlation (faster) or regression (slower but allows inclusion of covariates via `force_vars`).
`rsq_method`	character string indicating which correlation coefficient is to be computed. One of "pearson" (default), "kendall", or "spearman". See `collinear()`.

Details

stat_filter() is a wrapper which calls bin_stat_filter(), class_stat_filter() or cor_stat_filter() depending on whether y is binary, multiclass or continuous respectively. Ordered factors are converted to numeric (integer) levels and analysed as if continuous.

Value

Integer vector of indices of filtered parameters (type = "index") or character vector of names (type = "names") of filtered parameters in order of test p-value. If type is "full" full output is returned containing a dataframe of statistical results. If type is "list" the output is returned as a list of 2 matrices containing statistical results separated by continuous and categorical predictors.

Examples

library(mlbench)
data(BostonHousing2)
dat <- BostonHousing2
y <- dat$cmedv  ## continuous outcome
x <- subset(dat, select = -c(cmedv, medv, town))

stat_filter(y, x, type = "full")
stat_filter(y, x, nfilter = 5, type = "names")
stat_filter(y, x)

data(iris)
y <- iris$Species  ## 3 class outcome
x <- subset(iris, select = -Species)
stat_filter(y, x, type = "full")

nestedcv documentation built on April 4, 2025, 2:21 a.m.