knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

After creating an as.MLinput() object, the next phase in the peppuR pipeline involves common preprocessing steps such as:

  1. Handling missing values
  2. Correlation filtering
  3. Near-zero variance filtering
  4. Univariate feature selection

Since we have no missing data, we'll proceed into correlation filtering which utilizes Max Kuhn's caret package. In general we use a correlation matrix based approach with the peppuR function univariate_feature_selection()

library(peppuR)
library(MASS)
birthweight_data <- birthwt
birthweight_data$ID <- paste("ID",1:nrow(birthweight_data), sep = "_")
birthweight_data$low <- as.factor(birthweight_data$low)


# Make categorical columns factors
birthweight_data[, colnames(birthweight_data) %in% c("race", "smoke", "ht", "ui")] <- lapply(birthweight_data[, colnames(birthweight_data) %in% c("race", "smoke", "ht", "ui")], function(x) as.factor(x))

# Create an organized data object
single_source_peppuRobj <- as.MLinput(X = birthweight_data, Y = NULL, 
                                      meta_colnames = c("ID", "low"),
                                      categorical_features = TRUE,
                                      sample_cname = "ID", outcome_cname = "low")
single_source_peppuRobj <- univariate_feature_selection(single_source_peppuRobj)


pmartR/peppuR documentation built on Jan. 17, 2020, 12:54 p.m.