knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(projectPackage)

Introduction to projectPackage

The goal of projectPackage is to provide simplified functions to achieve the goal of data preprocessing for exploratory data analyses. The projectPackage makes the following steps quick and easy:

This document introduces you all of the tools mentioned above as well as examples of how to apply them to data frames and use them in different real-world scenarios.

Data: mtcars

For the purpose of demonstration, we will use a basic data set called mtcars. This data set has 11 columns, 32 rows and is documented in ?mtcars. Note that mtcars is a data frame and below is the first 6 rows.

head(mtcars)
data <- mtcars

Demonstration of basic functions in projectPackage

1. Dropping columns with data_cleaning()

Often when you work with large data sets, there will be columns that are redundant or not of interest. In that case, you can use our function data_cleaning() to drop those columns. For details on how to use the function, check out help(data_cleaning).

head(data_cleaning(data, "mpg"))

As you can see above, the column mpg has been dropped from the data frame.

head(data_cleaning(data, c("mpg","disp","qsec")))

Again, columns mpg, disp, and qsec have been dropped.

For instance, if I accidentally used the wrong column name mpf instead of mpg. The function will kindly produce an error.

data_cleaning(data, "mpf") will produce the following message: "Error in data_cleaning(data, "mpf") : column name does not exist in the data frame"

2. Creating a ggpairs correlation plot using correlation_graph()

Often times visualizations are needed to better understand the raw/modified data. Data visualization helps in the breakdown of complex problems by transforming data into a more understandable format and showing trends and outliers. A good visualization tells a story by reducing noise from data and emphasizing the most important facts.

We know that GGally's ggpairs() function can create a correlation matrix for us, however, the plot can be hard to read if there is a lot of data. Additionally, the plot can look a bit boring -- lacking colour. The correlation_graph() function solves those problems for us.

For details on how to use the function, check out help(correlation_graph).

correlation_graph(data[1:3])

3. Creating a recipe for knn model using recipe_scale_center()

If we want to make a model, we have to first create a recipe specifying a formula and any additional steps we want to perform; and, more often than not, we want to scale and center the data. So recipe_scale_center() creates a recipe and also scales and centers all the predictors in the given formula.

For example, we can create a recipe using mpg has the target and hp and cyl as the predictors.

recipe_scale_center(data, mpg ~ hp + cyl)


DSCI-310/DSCI-310-Group-11-package documentation built on April 9, 2022, 12:32 a.m.