Discretize.CausataData: Discretizes a continous variable in a CausataData object.

Description Usage Arguments Details Value Author(s) See Also Examples

Description

Discretize a continuous variable in a CausataData object, and record the process so that it can be reapplied during scoring.

Usage

1
2
## S3 method for class 'CausataData'
Discretize(this, variableName, breaks, discrete.values, verbose=FALSE, ...)

Arguments

this

An object from CausataData.

variableName

The name of the numeric CausataVariable to discretize.

breaks

A numeric vector of two or more cut points. This is used by cut to discretize the variable. See Details below for more information.

discrete.values

A numeric vector of discrete values that the continuous values will replaced with. See Details below for more information.

verbose

If TRUE then binning information is printed to the console.

...

Unused arguments for other methods.

Details

This function uses cut to discretize the variable; it is called with include.lowest=TRUE and right=TRUE. If N discrete bins are desired then breaks should have N+1 values for cut points.

Missing values are permitted, they will be mapped to a separate bin during discretization. This arrangement has three important conditions:

First, missing values must not be replaced (as in CleanNaFromContinous). Executing Discretize on a variable that was treated with CleanNaFromContinous will generate an error.

Second, ReplaceOutliers must be executed before Discretize, and the upper limit must be less than or equal to the last breaks value. Missing values are mapped to an artificial bin that is greater than the last value of breaks. Using ReplaceOutliers ensures that outliers are mapped to the existing values and not the missing values.

Third, if missing values are present in the variable and there are N bins, then N+1 discrete.values are required. By convention missing values are mapped to the last value of discrete.values.

Value

Returns a CausataData object.

Author(s)

Justin Hemann <support@causata.com>.

See Also

CausataData, CausataVariable, cut, CleanNaFromContinuous, ReplaceOutliers.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# create a random variable and a dependent variable
set.seed(1234)
ivn <- rnorm(1e5) # random data, normally distributed, no missing values
ivm <- ivn  # create a copy, but replace the first 100 values with NA (missing)
ivm[1:100] <- NA
dvn <- rep(0, 1e5)
dvn[(ivn + rnorm(1e5, sd=0.5))>0] <- 1
causataData <- CausataData(data.frame(ivn__AP=ivn, ivm__AP=ivm), dependent.variable=dvn)

# plot data before discretization
hist(causataData$df$ivn__AP, main="Before discretization.", col="gray")

# the replace outliers step is required
causataData <- ReplaceOutliers(causataData, 'ivn__AP', 
  lowerLimit=min(causataData$df$ivn__AP), 
  upperLimit=max(causataData$df$ivn__AP))

# discretize with deciles, 1st decile is mapped to 1, 2nd to 2, etc.
breaks <- quantile(ivn, probs=seq(0,1,0.1))
causataData <- Discretize(causataData, 'ivn__AP', breaks, 1:10, verbose=TRUE)

# plot data after discretization
hist(causataData$df$ivn__AP, main="After discretization.", col="gray", breaks=seq(0.5,10.5,1))

# Discretize data where missing values are present.  
# One extra value is required for discrete.values, map missing to 0.
# By convention missing values are mapped to the last element in discrete.values
causataData <- ReplaceOutliers(causataData, 'ivm__AP', 
  lowerLimit=min(causataData$df$ivm__AP, na.rm=TRUE),
  upperLimit=max(causataData$df$ivm__AP, na.rm=TRUE))
causataData <- Discretize(causataData, 'ivm__AP', breaks, c(1:10,0), verbose=TRUE)

Causata documentation built on May 2, 2019, 3:26 a.m.