Description Usage Arguments Details Value Author(s) See Also Examples
Discretize a continuous variable in a CausataData
object, and record the process so that it can be reapplied during scoring.
1 2 | ## S3 method for class 'CausataData'
Discretize(this, variableName, breaks, discrete.values, verbose=FALSE, ...)
|
this |
An object from |
variableName |
The name of the numeric |
breaks |
A numeric vector of two or more cut points. This is used by |
discrete.values |
A numeric vector of discrete values that the continuous values will replaced with. See Details below for more information. |
verbose |
If |
... |
Unused arguments for other methods. |
This function uses cut
to discretize the variable; it is called with include.lowest=TRUE
and right=TRUE
.
If N discrete bins are desired then breaks
should have N+1 values for cut points.
Missing values are permitted, they will be mapped to a separate bin during discretization. This arrangement has three important conditions:
First, missing values must not be replaced
(as in CleanNaFromContinous
). Executing Discretize
on a variable that was treated
with CleanNaFromContinous
will generate an error.
Second, ReplaceOutliers
must be executed before Discretize
, and the upper limit must be
less than or equal to the last breaks
value.
Missing values are mapped to an artificial bin that is greater than the last value of breaks
.
Using ReplaceOutliers
ensures that outliers are mapped to the existing values and not the missing values.
Third, if missing values are present in the variable and there are
N bins, then N+1 discrete.values
are required.
By convention missing values are mapped to the last value of discrete.values
.
Returns a CausataData
object.
Justin Hemann <support@causata.com>.
CausataData
, CausataVariable
, cut
, CleanNaFromContinuous
, ReplaceOutliers
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | # create a random variable and a dependent variable
set.seed(1234)
ivn <- rnorm(1e5) # random data, normally distributed, no missing values
ivm <- ivn # create a copy, but replace the first 100 values with NA (missing)
ivm[1:100] <- NA
dvn <- rep(0, 1e5)
dvn[(ivn + rnorm(1e5, sd=0.5))>0] <- 1
causataData <- CausataData(data.frame(ivn__AP=ivn, ivm__AP=ivm), dependent.variable=dvn)
# plot data before discretization
hist(causataData$df$ivn__AP, main="Before discretization.", col="gray")
# the replace outliers step is required
causataData <- ReplaceOutliers(causataData, 'ivn__AP',
lowerLimit=min(causataData$df$ivn__AP),
upperLimit=max(causataData$df$ivn__AP))
# discretize with deciles, 1st decile is mapped to 1, 2nd to 2, etc.
breaks <- quantile(ivn, probs=seq(0,1,0.1))
causataData <- Discretize(causataData, 'ivn__AP', breaks, 1:10, verbose=TRUE)
# plot data after discretization
hist(causataData$df$ivn__AP, main="After discretization.", col="gray", breaks=seq(0.5,10.5,1))
# Discretize data where missing values are present.
# One extra value is required for discrete.values, map missing to 0.
# By convention missing values are mapped to the last element in discrete.values
causataData <- ReplaceOutliers(causataData, 'ivm__AP',
lowerLimit=min(causataData$df$ivm__AP, na.rm=TRUE),
upperLimit=max(causataData$df$ivm__AP, na.rm=TRUE))
causataData <- Discretize(causataData, 'ivm__AP', breaks, c(1:10,0), verbose=TRUE)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.