knitr::opts_chunk$set(echo = TRUE)

About This Demo

This file demonstrates a typical process of using R package "cleandata" to prepare data for machine learning.

R Package "cleandata"

A collection of functions that work with data frame to inspect, impute, encode, and partition data. The functions for imputation, encoding, and partitioning can produce log files to help you keep track of data manipulation process.

Available on CRAN: https://cran.r-project.org/package=cleandata

Source Codes on GitHub: https://github.com/sherrisherry/cleandata

Dataset Used in This Demo

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa for predicting housing prices, this dataset is a typical example of what a business analyst encounters everyday.


Inspection

According to the description of this dataset, the "NA"s in some columns aren't missing value. To prevent R from comfusing them with true missing values, read in the data files without converting any value to the NA in R.

The train set should have only one more column SalePrice than the test set.

# import 'cleandata' package.
library('cleandata')

# read in the training and test datasets without converting 'NA's to missing values.
train <- read.csv('data/train.csv', na.strings = "", strip.white = TRUE)
test <- read.csv('data/test.csv', na.strings = "", strip.white = TRUE)

# summarize the training set and test set
cat(paste('train: ', nrow(train), 'obs. ', ncol(train), 'cols\ncolumn names:\n', toString(colnames(train)), 
          '\n\ntest: ', nrow(test), 'obs. ', ncol(test), 'cols\ncolumn names:\n', toString(colnames(test)), '\n'))

To ensure consistency in the following imputation and encoding process across the train set and the test set, I appended the test set to the train set. The SalePrice values of the rows of the test set was set to NA to distinguish them from the rows of the train set. The resulting data frame was called df.

# filling the target columns of the test set with NA then combining test and training sets
test$SalePrice <- NA
df <- rbind(train, test)
rm(train, test)

Function inspect_na

inspect_na() counts the number of NAs in each column and sort them in descending order. In the following operation, inspect_na() returned the top 5 columns with missing values. If you want to see the number of missing values in every column, leave parameter top as default. As supposed, only SalePrice contained missing values, which equaled to the number of rows in the test set.

inspect_na(df, top = 5)

The NAs in the columns listed in NAisNoA were what was refered to as 'none'-but-not-'NA' values. In these columns, NA had only one possible value - "not applicable". I replaced these NAs with NoA to prevent imputing them later.

# in the 'NAisNoA' columns, NA means this attribute doesn't apply to them, not missing.
NAisNoA <- c('Alley', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 
             'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 
             'PoolQC', 'Fence', 'MiscFeature')
for(i in NAisNoA){levels(df[, i])[levels(df[, i]) == "NA"] <- "NoA"}

At this stage, I reconstructed the data frame df to inspect the true missing values.

We can see that only LotFrontage had about 20% missing values. The other columns had few to no missing value.

# write the dataset into a csv file then read this file back to df to reconstruct df
write.csv(df, file = 'data/data.csv', row.names = FALSE)
df <- read.csv('data/data.csv', na.strings = "NA", strip.white = TRUE)

# see which predictors have most NAs
inspect_na(df[, -ncol(df)], top = 25)

Function inspect_map

inspect_map() classifies the columns of a data frame. Before I further explain this function, I'd like to introduce 'scheme'. In package "cleandata", a scheme refers to a set of all the possible values of an enumerator. The factor objects in R are enumerators.

Function inspect_map returns a list of factor_cols (list), factor_levels (list), num_cols (vector), char_cols (vector), ordered_cols (vector), and other_cols (vector).

In the following codes, I specified that 2 factorial columns share the same scheme if their levels had more than 2 same values by setting the common parameter to 2. By default, the common parameter is 0, which means every level of 2 factorial columns should be the same for them to share the same scheme.

# create a map for imputation and encoding
data_map <- inspect_map(df[, -ncol(df)], common = 2)
summary(data_map)

This dataset only had factorial and numeric columns. I unpacked data_map before heading to imputation and encoding.

factor_cols <- data_map$factor_cols
factor_levels <- data_map$factor_levels
num_cols <- data_map$num_cols
rm(data_map)

Imputation and Encoding

The functions for imputation and encoding keep track of your process by producing log files. This feature is by default disabled. To enable log files, I created an environment variable log_arg to storep the list of arguments for sink(). In old versions, log_arg should be assigned to log parameter in every imputation or encoding function, which is still supported by this version.

# create a list of arguments for producing a log file
log_arg <- list(file = 'log.txt', append = TRUE, split = FALSE)

log_arg can be a list of any arguments for sink(). In this example, the log file was named "log.txt", new information was appended to the file, and the contents to the log file weren't printed to the standard output.

In this version, parameter log by default searches a list called log_arg in the dynamic scope parent environment and takes the value of log_arg. If log is assigned a list, it takes the assigned value. If no a list log_arg in the parent and no list is assigned to log, no log file.

Imputation Functions

To prevent leakage, I instructed the imputation functions to use only rows of the train set to calculate the imputation values by passing an index to parameter idx.

Function impute_mode, impute_median, impute_mean

impute_mode() works with both numerical, string, and factorial columns. It impute NAs by the modes of their corresponding columns.

impute_median() and impute_mean() only work with numerical columns. They impute NAs by medians and means respectively.

# impute NAs in factorial columns by the mode of corresponding columns
lst <- unlist(factor_cols)
df <- impute_mode(df, cols = lst, idx = !is.na(df$SalePrice))

# impute NAs in numerical columns by the median of corresponding columns
lst <- num_cols
df <- impute_median(df, cols = lst, idx = !is.na(df$SalePrice))

# check the result
inspect_na(df[, -ncol(df)], top = 5)

Encoding Functions

Every encoding function prints summary of the columns before and after encoding by default. The output of encode_ordinal() and encode_binary() is by default factorial. If you want numerical output, set parameter out.int to TRUE after making sure no missing value in the input.

In this demo, I kept the encoded columns factorial because I intended to save the dataset into a csv file, which doesn't distinguish between factorial and numerical columns.

Encoding Ordinal Columns

In business datasets, we can often find ratings, which are ordinal and use similar schemes. Based on my experience, if many columns share the same scheme, they are likely to be ratings.

summary(factor_cols)

In our dataset ExterQual and other 9 columns share the same scheme. After I checked their scheme and the description file, I was sure that they were ordinal.

factor_levels$ExterQual

"Po": poor, "Fa": fair, "TA": typical/average, "Gd": good, "Ex": excellent

Function encode_ordinal

encode_ordinal() encodes ordinal data into sequential integers by a given order. The argument passed to none is always encoded to 0. The 1st member of the vector passed to order is encoded to 1.

# encoding ordinal columns
i <- 'ExterQual'; lst <- c('Po', 'Fa', 'TA', 'Gd', 'Ex')
df[, factor_cols[[i]]] <- encode_ordinal(df[, factor_cols[[i]]], order = lst, none = 'NoA')

# removing encoded columns from the map
factor_levels[[i]] <- NULL
factor_cols[[i]] <- NULL

The Utilities column was binary according the dataset.

factor_levels$Utilities
levels(df$Utilities)

However, the description file indicates that it has 4 possible values: 'ELO', 'NoSeWa', 'NoSewr', 'AllPub'. Therefore, I encoded it as having 4 levels.

# in dataset only "AllPub" "NoSeWa", with 2 NAs
i <- 'Utilities'; lst <- c('ELO', 'NoSeWa', 'NoSewr', 'AllPub')
df[, factor_cols[[i]]] <- encode_ordinal(df[, factor_cols[[i]], drop=FALSE], order = lst)
factor_levels[[i]]<-NULL
factor_cols[[i]]<-NULL

Encoding Binary Columns

# find all the 2-level columns
lst <- lapply(factor_levels, length)
lst <- as.data.frame(lst)
lst <- colnames(lst[, lst == 2])
cat(lst)

Function encode_binary

encode_binary() encodes binary data into 0 and 1, regardless of order.

# encode all the 2-level columns
i <- unlist(factor_cols[lst])
df[, i] <- encode_binary(df[, i, drop=FALSE])
factor_levels[lst] <- NULL
factor_cols[lst] <- NULL

Encoding other categorical Columns

Although we may have found more ordinal columns, I wanted to speed up our process so I assumed that all the remaining categorical columns were not ordered.

Function encode_onehot

encode_onehot() encodes categorical data by One-hot encoding.

# encode all the other categorical Columns
i <- unlist(factor_cols)
df0 <- encode_onehot(df[, i, drop=FALSE])
df[, i] <- NULL
df <- cbind(df, df0)

Partitioning

Function partition_random

partition_random() partitions a dataset randomly.

# partition the dataset
df0 <- partition_random(df[!is.na(df$SalePrice),], train = 8, test = FALSE)

The Log File

Let's check the log file at the end.

cat(paste(readLines('log.txt'), collapse = '\n'))

=== end ===



sherrisherry/cleandata documentation built on May 7, 2019, 5:02 a.m.