knitr::opts_chunk$set(echo = TRUE)
This file demonstrates a typical process of using R package "cleandata" to prepare data for machine learning.
A collection of functions that work with data frame to inspect, impute, encode, and partition data. The functions for imputation, encoding, and partitioning can produce log files to help you keep track of data manipulation process.
Available on CRAN: https://cran.r-project.org/package=cleandata
Source Codes on GitHub: https://github.com/sherrisherry/cleandata
log
.With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa for predicting housing prices, this dataset is a typical example of what a business analyst encounters everyday.
According to the description of this dataset, the "NA"s in some columns aren't missing value. To prevent R from comfusing them with true missing values, read in the data files without converting any value to the NA
in R.
The train set should have only one more column SalePrice than the test set.
# import 'cleandata' package. library('cleandata') # read in the training and test datasets without converting 'NA's to missing values. train <- read.csv('data/train.csv', na.strings = "", strip.white = TRUE) test <- read.csv('data/test.csv', na.strings = "", strip.white = TRUE) # summarize the training set and test set cat(paste('train: ', nrow(train), 'obs. ', ncol(train), 'cols\ncolumn names:\n', toString(colnames(train)), '\n\ntest: ', nrow(test), 'obs. ', ncol(test), 'cols\ncolumn names:\n', toString(colnames(test)), '\n'))
To ensure consistency in the following imputation and encoding process across the train set and the test set, I appended the test set to the train set. The SalePrice values of the rows of the test set was set to NA
to distinguish them from the rows of the train set. The resulting data frame was called df.
# filling the target columns of the test set with NA then combining test and training sets test$SalePrice <- NA df <- rbind(train, test) rm(train, test)
Function
inspect_na
inspect_na()
counts the number of NA
s in each column and sort them in descending order. In the following operation, inspect_na()
returned the top 5 columns with missing values. If you want to see the number of missing values in every column, leave parameter top
as default. As supposed, only SalePrice contained missing values, which equaled to the number of rows in the test set.
inspect_na(df, top = 5)
The NAs in the columns listed in NAisNoA were what was refered to as 'none'-but-not-'NA' values. In these columns, NA had only one possible value - "not applicable". I replaced these NAs with NoA to prevent imputing them later.
# in the 'NAisNoA' columns, NA means this attribute doesn't apply to them, not missing. NAisNoA <- c('Alley', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence', 'MiscFeature') for(i in NAisNoA){levels(df[, i])[levels(df[, i]) == "NA"] <- "NoA"}
At this stage, I reconstructed the data frame df to inspect the true missing values.
We can see that only LotFrontage had about 20% missing values. The other columns had few to no missing value.
# write the dataset into a csv file then read this file back to df to reconstruct df write.csv(df, file = 'data/data.csv', row.names = FALSE) df <- read.csv('data/data.csv', na.strings = "NA", strip.white = TRUE) # see which predictors have most NAs inspect_na(df[, -ncol(df)], top = 25)
Function
inspect_map
inspect_map()
classifies the columns of a data frame. Before I further explain this function, I'd like to introduce 'scheme'. In package "cleandata", a scheme refers to a set of all the possible values of an enumerator. The factor objects in R are enumerators.
Function inspect_map
returns a list of factor_cols (list), factor_levels (list), num_cols (vector), char_cols (vector), ordered_cols (vector), and other_cols (vector).
common
for more information about scheme.In the following codes, I specified that 2 factorial columns share the same scheme if their levels had more than 2 same values by setting the common
parameter to 2. By default, the common
parameter is 0, which means every level of 2 factorial columns should be the same for them to share the same scheme.
# create a map for imputation and encoding data_map <- inspect_map(df[, -ncol(df)], common = 2) summary(data_map)
This dataset only had factorial and numeric columns. I unpacked data_map before heading to imputation and encoding.
factor_cols <- data_map$factor_cols factor_levels <- data_map$factor_levels num_cols <- data_map$num_cols rm(data_map)
The functions for imputation and encoding keep track of your process by producing log files. This feature is by default disabled. To enable log files, I created an environment variable log_arg to storep the list of arguments for sink()
. In old versions, log_arg should be assigned to log
parameter in every imputation or encoding function, which is still supported by this version.
# create a list of arguments for producing a log file log_arg <- list(file = 'log.txt', append = TRUE, split = FALSE)
log_arg can be a list of any arguments for sink()
. In this example, the log file was named "log.txt", new information was appended to the file, and the contents to the log file weren't printed to the standard output.
In this version, parameter log
by default searches a list called log_arg in the dynamic scope parent environment and takes the value of log_arg. If log
is assigned a list, it takes the assigned value. If no a list log_arg in the parent and no list is assigned to log
, no log file.
To prevent leakage, I instructed the imputation functions to use only rows of the train set to calculate the imputation values by passing an index to parameter idx
.
Function
impute_mode
,impute_median
,impute_mean
impute_mode()
works with both numerical, string, and factorial columns. It impute NA
s by the modes of their corresponding columns.
impute_median()
and impute_mean()
only work with numerical columns. They impute NA
s by medians and means respectively.
# impute NAs in factorial columns by the mode of corresponding columns lst <- unlist(factor_cols) df <- impute_mode(df, cols = lst, idx = !is.na(df$SalePrice)) # impute NAs in numerical columns by the median of corresponding columns lst <- num_cols df <- impute_median(df, cols = lst, idx = !is.na(df$SalePrice)) # check the result inspect_na(df[, -ncol(df)], top = 5)
Every encoding function prints summary of the columns before and after encoding by default. The output of encode_ordinal()
and encode_binary()
is by default factorial. If you want numerical output, set parameter out.int
to TRUE
after making sure no missing value in the input.
In this demo, I kept the encoded columns factorial because I intended to save the dataset into a csv file, which doesn't distinguish between factorial and numerical columns.
In business datasets, we can often find ratings, which are ordinal and use similar schemes. Based on my experience, if many columns share the same scheme, they are likely to be ratings.
summary(factor_cols)
In our dataset ExterQual and other 9 columns share the same scheme. After I checked their scheme and the description file, I was sure that they were ordinal.
factor_levels$ExterQual
"Po": poor, "Fa": fair, "TA": typical/average, "Gd": good, "Ex": excellent
Function
encode_ordinal
encode_ordinal()
encodes ordinal data into sequential integers by a given order. The argument passed to none
is always encoded to 0. The 1st member of the vector passed to order
is encoded to 1.
# encoding ordinal columns i <- 'ExterQual'; lst <- c('Po', 'Fa', 'TA', 'Gd', 'Ex') df[, factor_cols[[i]]] <- encode_ordinal(df[, factor_cols[[i]]], order = lst, none = 'NoA') # removing encoded columns from the map factor_levels[[i]] <- NULL factor_cols[[i]] <- NULL
The Utilities column was binary according the dataset.
factor_levels$Utilities levels(df$Utilities)
However, the description file indicates that it has 4 possible values: 'ELO', 'NoSeWa', 'NoSewr', 'AllPub'. Therefore, I encoded it as having 4 levels.
# in dataset only "AllPub" "NoSeWa", with 2 NAs i <- 'Utilities'; lst <- c('ELO', 'NoSeWa', 'NoSewr', 'AllPub') df[, factor_cols[[i]]] <- encode_ordinal(df[, factor_cols[[i]], drop=FALSE], order = lst) factor_levels[[i]]<-NULL factor_cols[[i]]<-NULL
# find all the 2-level columns lst <- lapply(factor_levels, length) lst <- as.data.frame(lst) lst <- colnames(lst[, lst == 2]) cat(lst)
Function
encode_binary
encode_binary()
encodes binary data into 0 and 1, regardless of order.
# encode all the 2-level columns i <- unlist(factor_cols[lst]) df[, i] <- encode_binary(df[, i, drop=FALSE]) factor_levels[lst] <- NULL factor_cols[lst] <- NULL
Although we may have found more ordinal columns, I wanted to speed up our process so I assumed that all the remaining categorical columns were not ordered.
Function
encode_onehot
encode_onehot()
encodes categorical data by One-hot encoding.
# encode all the other categorical Columns i <- unlist(factor_cols) df0 <- encode_onehot(df[, i, drop=FALSE]) df[, i] <- NULL df <- cbind(df, df0)
Function
partition_random
partition_random()
partitions a dataset randomly.
# partition the dataset df0 <- partition_random(df[!is.na(df$SalePrice),], train = 8, test = FALSE)
Let's check the log file at the end.
cat(paste(readLines('log.txt'), collapse = '\n'))
=== end ===
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.