knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(eaziReda)

Introduction to eaziReda

Exploratory data analysis (EDA) and data preprocessing are two crucial steps in the data analysis workflow. To be specific, the users usually need to do the following tasks:

The eaziReda that wraps all of those tasks into several convenient functions that allow the users to quickly and easily carry out EDA along with some simple preprocessing:

Data: mtcars

To illustrate how different functions of the eaziReda package work, we will use the mtcars dataset, which is included in the base R datasets. Also, in order to show the full picture of functions, we choose to modify the mtcars dataset and create a new dataset called mtcars_new by making the data more complicated and unclean. The modifications are shown as below:

Change the types of two columns to categorical:

mtcars_new <- mtcars %>% 
  dplyr::mutate(cyl = as.factor(cyl), gear = as.factor(gear))

Create some missing values:

  mtcars_new$mpg[2:5] <- NA
  mtcars_new$cyl[2:5] <- NA

Create some outliers:

mtcars_new$hp[2:5] <- 500

After simple modification, the first 10 rows of mtcars_new are shown as follows:

head(mtcars_new, 10)

Identifying Missing Values with missing_detect()

missing_detect() can be used to detect the missing values of a given dataset. There is only one argument for this function, which can be either base-R dataframe or tibble. The output is a table that lists the number of missing values (n_missing) and the percentage of missing values for each column (percent).

For example, we can assess the missing values in the mtcars_new.

missing_detect(mtcars_new)

Imputing Missing Values with missing_impute()

missing_detect() can be used to impute the missing values all together for a given dataset. The first argument is the dataset, which can be either base-R dataframe or tibble. The second argument is the method used for imputing numerical missing values (one of "mean", "median", "drop"). The third argument is the method used for imputing non-numerical missing values (one of "most_frequent", "drop"). The output is an imputed version of the input dataset.

For example, we can impute the mtcars_new by filling the NAs by column mean if NA appears in numerical column, and filling the NAs by most frequent values otherwise. The table below shows the first 10 rows of the imputed mtcars_new.

mtcars_new_imputed <- missing_impute(mtcars_new, method_num = "mean", method_non_num = "most_frequent")
head(mtcars_new_imputed, 10)

Detecting the Outliers with outliers_detect()

outlier_detect() can be used to detect the outliers for a given vector/column. The first argument is the vector that the user would like to detect outliers for. The second argument is the method to detect outliers.(one of "zscore", "iqr", "iforest"). The output is a boolean vector with indices marked TRUE for the outliers.

For example, we can assess the outliers for the column hp based on the rule of interquartile range (IQR).

outliers_detect(mtcars_new$hp, method = "iqr")

Removing the Outliers with remove_outliers()

remove_outliers() can be used to remove the outliers for a given vector/column. The first argument is the vector that the user would like to delete the outliers for. The second argument is a boolean vector with outliers marked true. The user can directly pass the results of outliers_detect() as the second argument and remove the corresponding outliers easily. The output is a vector with outliers deleted from the input vector.

For example, we can remove the outliers that we identified before for the column hp.

remove_outliers(mtcars_new$hp, outliers_detect(mtcars_new$hp, method = "iqr"))

Plotting Different Features with histograms()

histograms can be used to create plots given a list of different kinds of features (columns). The first argument is the dataset, which can be either base-R dataframe or tibble. The second argument is a character vector that contains all the column names that the user would like to plot for. The third argument is an integer that indicates the number of columns in faceted chart produced. The output is a faceted chart that includes one plot each column the user specifies in the second argument. For numerical columns, the corresponding output plots will be the histograms. For categorical columns, the corresponding output plot will be the bar charts.

For example, we can plot and explore the columns: mpg, disp, cyl, and gear together.

histograms(mtcars_new, features=c("mpg", "disp", "cyl", "gear"), num_cols = 2)

Plotting Correlations with corr_plot()

corr_plot() can be used to create a pairwise correlation plot given a list of numerical features (columns). The first argument is the dataset, which can be either base-R dataframe or tibble. The second argument is a character vector that contains all the names of the numerical columns that the user would like to plot for. The third argument is the method to calculate the correlation (one of "pearson", "spearman" or "kendall"). The output is a correlation heatmap showing the correlation between the specified columns.

For example, we can explore the pearson correlation coefficients between the columns: mpg, disp, and hp.

corr_plot(mtcars_new, features=c("mpg", "disp","hp"), method = 'pearson')

Key Advantages



UBC-MDS/eaziReda documentation built on March 24, 2021, 2:22 a.m.