nzv: Near-zero variance

View source: R/24_NZV.R

nzvR Documentation

Near-zero variance

Description

nzv procedure aims to identify risk factors with low variability (almost constants). Usually these risk factors are expertly investigated and decision is made if they should be excluded from further modeling process.
nzv output report includes the following metrics:

  • rf: Risk factor name.

  • rf.type: Risk factor class. This metric is always one of: numeric or categorical.

  • sc.num: Number of special cases.

  • sc.pct: Percentage of special cases in total number of observations.

  • cc.num: Number of complete cases.

  • cc.pct: Percentage of complete cases in total number of observations. Sum of this value and sc.pct is equal to 1.

  • cc.unv: Number of unique values in complete cases.

  • cc.unv.pct: Percentage of unique values in total number of complete cases.

  • cc.lbl.1: The most frequent value in complete cases.

  • cc.frq.1: Number of occurrence of the most frequent value in complete cases.

  • cc.lbl.2: The second most frequent value in complete cases.

  • cc.frq.2: Number of occurrence of the second most frequent value in complete cases.

  • cc.fqr: Frequency ratio - the ratio between the occurrence of most frequent and the second most frequent value in complete cases.

  • ind: Indicator which takes value of 1 if the percentage of complete cases is less then 10% and frequency ratio (cc.fqr) greater than 19. This values can be used for filtering risk factors that need further expert investigation, but user are also encourage to derive its own indicators based on reported metrics.

Usage

nzv(db, sc = c(NA, NaN, Inf, -Inf))

Arguments

db

Data frame of risk factors supplied for near-zero variance analysis.

sc

Numeric or character vector with special case elements. Default values are c(NA, NaN, Inf, -Inf).

Value

The command nzv returns the data frame with different matrices needed for identification of near-zero variables. For details see Description section.

Examples

suppressMessages(library(PDtoolkit))
data(loans)
#artificially add some special values
loans$"Account Balance"[1:10] <- NA
rf.s <- nzv(db = loans, sc = c(NA, NaN, Inf, -Inf))
rf.s

PDtoolkit documentation built on Sept. 20, 2023, 9:06 a.m.