library(pmut)
library(data.table)
library(ggplot2)
knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE, fig.width=10, fig.height=7)

This package is a collection of utility functions that facilitate general predictive modeling work. Function usages include but not limited to diagnostic visualization, model metric, data quality check. If you have any feedback, or any function you want to have in the package, please reach out to chengjun.hou@gmail.com or connect via GitHub.

To install the package, use the following command in R:

devtools::install_github("chengjunhou/pmut")


Diagnostic Visualization {#dgvis}

pmut.edap.disc

This function creates a visualization for a line plot of one discrete feature against the response, plus a distribution histogram for that discrete feature. In the line plot, the discrete feature will be the x-axis while the response be the y-axis, which will serve as Actual. NA will be formed as its own level. More lines of Prediction can be created by inputting a prediction data.frame.

pmut.edap.disc(datatable, varstring, targetstring, pred.df=NULL)

We use the diamond dataset from ggplot2 to do the demo:

df = data.frame(ggplot2::diamonds)
pmut.edap.disc(df, "color", "price", pred.df=data.frame(GLM=rnorm(dim(df)[1],4000,5000)))

pmut.edap.cont

This function creates a visualization for a line plot of one continuous feature against the response, plus a distribution histogram for that continuous feature. In the line plot, the continuous feature will be cut into bins and then placed on the x-axis. The response will be the y-axis, which will serve as Actual. Binning characteristics will be controlled by meta and qbin. NA will be formed as its own bin. More lines of Prediction can be created by inputting a prediction data.frame.

pmut.edap.cont(datatable, varstring, targetstring, meta=c(50,4,0.01,0.99), qbin=FALSE, pred.df=NULL)

Note that the first bin in the following view is ranging from the minimum of "carat" to its 1% percentile (meta[3]=0.01), while the last bin is ranging from 99% percentile to the maximum of "carat" (meta[4]=0.99).

pmut.edap.cont(df, "carat", "price", pred.df=data.frame(GLM1=rnorm(dim(df)[1],4000,5000),
                                                        GLM2=rnorm(dim(df)[1],2000,5000)))

Note that in the following quantile view, since we specify the outlier percentile to be 0% (meta[3]=0) and 100% (meta[4]=1), we need to input 12 (meta[1]) to have 10 bins in the view. And the counts within each bin are not perfectly equal because of rounding and the nature of the data.

pmut.edap.cont(df, "carat", "price", meta=c(12,2,0,1), qbin=TRUE)

pmut.edap

This function creates visualization for a vector of features, using either pmut.edap.disc() or pmut.edap.cont(), depending on the feature class. Columns of class factor, character, and logical will use pmut.edap.disc(); Column of class numeric will use pmut.edap.cont(); Column of class integer with unique values smaller than number of bins specified by meta will use pmut.edap.disc(), otherwise use pmut.edap.cont(). Some progression information will be printed on console.

Same arguments as pmut.edap.cont() except varvec.

pmut.edap(datatable, varvec, targetstring, meta=c(50,4,0.01,0.99), qbin=FALSE, pred.df=NULL)

# output the plots into a pdf file
pdf("EDA_Diamonds.pdf", width=12, height=10)
pmut.edap(df, names(df)[-7], "price")
dev.off()

Model Metric {#metrc}

pmut.auc

This function calculates area under the ROC curve for prediction against actual, without any package dependency.

pmut.auc(aa, pp, plot=FALSE)

actuals = c(1,1,1,1,0,1,1,0,1,0,1,0,1,0,0,1,0,0,0,0)
predicts = rev(seq_along(actuals)); predicts[9:10] = mean(predicts[9:10])
pmut.auc(actuals, predicts, plot=TRUE)

pmut.gini

This function calculates the standardized gini coefficient for prediction agianst actual.

pmut.gini(aa, pp, print=FALSE)

pmut.gini(actuals, predicts, print=TRUE)

Data Preparation for Scoring {#score}

pmut.base.find

This function finds the meta information for each column within training data, which will be used to process new data so that it can be scored without error, check pmut.base.prep() for the preparation part. Meta information for columns of class factor, character, and logical will form a list. Each element of the list contains three slots: 1st $VarString is column name, 2nd $LvlVec is vector of unique levels, 3rd $LvlBase is base level name which is the level with most counts. Meta information for columns of class integer, and numeric will form another list. Each element of the list contains two slots: 1st $VarString is column name, 2nd $ValueMean is its value mean.

pmut.base.find(DATA)

pmut.base.prep

This function takes meta information generated by pmut.base.find(), prepares new data so that it can be scored without error. It conducts a few things: it handles missing value imputation either by assigning to base level (categorical) or mean value (numeric); it assigns levels not found in meta but observed in new data to base level; it handles levels found in meta but not observed in new data by treating the column as factor then specifying the levels; it handles entire column found in meta but not observed in new data by imputing the entire column with base level or mean value; it attaches symbol "!" with every base level; lastly, it orders the columns alphabetically. Note that data processed by this function will only have two classes: factor for categorical, numeric for numeric. Then model.matrix() will produce data matrix with exactly identical format to be scored for a glmnet or xgboost model.

pmut.base.prep(DATA, CatMeta, NumMeta)

temp = pmut.base.find(data.frame(ggplot2::diamonds))
# remove two columns
newdata = data.frame(ggplot2::diamonds)[,-c(2,6)]
# assign new color
newdata$color = "NEW"
# temp[[1]] categorical meta, temp[[2]] numeric meta 
newdata = pmut.base.prep(newdata, temp[[1]], temp[[2]])
head(newdata)
sapply(newdata, class)

Note that attaching symbol "!" is to make sure that model.matrix() will remove same level when conducting dummy encoding for a categorical feature. So after obtaining meta list from the training data with pmut.base.find(), training data also needs to be processed by pmut.base.prep() before model fitting.

Simple Quality Check {#check}

pmut.data.pmis

This function checks percenrage of NA (include empty string for character) inside each column of the data.

pmut.data.pmis(DATA)

pmut.data.pmis(data.frame(ggplot2::diamonds))

pmut.data.same

This function checks if there is any duplicated column inside the data.

pmut.data.same(DATA)

pmut.data.same(data.frame(ggplot2::diamonds))

pmut.data.scal

This function standardizes numeric column inside the data.

pmut.data.scal(DATA)

head(pmut.data.scal(data.frame(ggplot2::diamonds)))



chengjunhou/pmut documentation built on May 23, 2019, 4:24 p.m.