univariate: Univariate analysis

View source: R/01_UNIVARIATE_ANALYSIS.R

univariateR Documentation

Univariate analysis

Description

univariate returns the univariate statistics for risk factors supplied in data frame db.
For numeric risk factors univariate report includes:

  • rf: Risk factor name.

  • rf.type: Risk factor class. This metric is always equal to numeric.

  • bin.type: Bin type - special or complete cases.

  • bin: Bin type. If a sc.method argument is equal to "together", then bin and bin.type have the same value. If the sc.method argument is equal to "separately", then the bin will contain all special cases that exist for analyzed risk factor (e.g. NA, NaN, Inf).

  • pct: Percentage of observations in each bin.

  • cnt.unique: Number of unique values per bin.

  • min: Minimum value.

  • p1, p5, p25, p50, p75, p95, p99: Percentile values.

  • avg: Mean value.

  • avg.se: Standard error of the mean.

  • max: Maximum value.

  • neg: Number of negative values.

  • pos: Number of positive values.

  • cnt.outliers: Number of outliers. Records above or below Q75\pm1.5 * IQR, where IQR = Q75 - Q25.

  • sc.ind: Special case indicator. It takes value 1 if share of special cases exceeds sc.threshold otherwise 0.

For categorical risk factors univariate report includes:

  • rf: Risk factor name.

  • rf.type: Risk factor class. This metric is equal to one of: character, factor or logical.

  • bin.type: Bin type - special or complete cases.

  • bin: Bin type. If a sc.method argument is equal to "together", then bin and bin.type have the same value. If the sc.method argument is equal to "separately", then the bin will contain all special cases that exist for analyzed risk factor (e.g. NA, NaN, Inf).

  • pct: Percentage of observations in each bin.

  • cnt.unique: Number of unique values per bin.

  • sc.ind: Special case indicator. It takes value 1 if share of special cases exceeds sc.threshold otherwise 0.

Usage

univariate(
  db,
  sc = c(NA, NaN, Inf, -Inf),
  sc.method = "together",
  sc.threshold = 0.2
)

Arguments

db

Data frame of risk factors supplied for univariate analysis.

sc

Vector of special case elements. Default values are c(NA, NaN, Inf).

sc.method

Define how special cases will be treated, all together or in separate bins. Possible values are "together", "separately".

sc.threshold

Threshold for special cases expressed as percentage of total number of observations. If sc.method is set to "separately", then percentage for each special case will be summed up.

Value

The command univariate returns the data frame with explained univariate metrics for numeric, character, factor and logical class of risk factors.

Examples

suppressMessages(library(PDtoolkit))
data(gcd)
gcd$age[100:120] <- NA
gcd$age.bin <- ndr.bin(x = gcd$age, y = gcd$qual, y.type = "bina")[[2]]
gcd$age.bin <- as.factor(gcd$age.bin)
gcd$maturity.bin <- ndr.bin(x = gcd$maturity, y = gcd$qual, y.type = "bina")[[2]]
gcd$amount.bin <- ndr.bin(x = gcd$amount, y = gcd$qual, y.type = "bina")[[2]]
gcd$all.miss1 <- NaN
gcd$all.miss2 <- NA
gcd$tf <- sample(c(TRUE, FALSE), nrow(gcd), rep = TRUE)
#create date variable to confirm that it will not be processed by the function
gcd$dates <- Sys.Date()
str(gcd)
univariate(db = gcd)

PDtoolkit documentation built on Sept. 20, 2023, 9:06 a.m.