In UBC-MDS/eaziReda: A Quick And Easy Way To Do EDA And Preprocessing

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(eaziReda)

Introduction to eaziReda

Exploratory data analysis (EDA) and data preprocessing are two crucial steps in the data analysis workflow. To be specific, the users usually need to do the following tasks:

checking missing values - detecting outliers
plotting correlation plots between features
plotting histograms/bar plots for each individual features
imputation
dealing with outliers

The eaziReda that wraps all of those tasks into several convenient functions that allow the users to quickly and easily carry out EDA along with some simple preprocessing:

It can report the number/percentage of missing values for all columns in a given dataframe as well as giving the user an option of imputing the missing values all together based on the imputation methods the user specifies.
It can identify all outliers for a given column/vector based on the methods the user specifies as well as giving the user an option of removing the outliers all together.
It can automatically generate a correlation plot based on user's choice of features.
It can automatically generate histograms and bar plots based on user's choice of features.

Data: mtcars

To illustrate how different functions of the eaziReda package work, we will use the mtcars dataset, which is included in the base R datasets. Also, in order to show the full picture of functions, we choose to modify the mtcars dataset and create a new dataset called mtcars_new by making the data more complicated and unclean. The modifications are shown as below:

Change the types of two columns to categorical:

mtcars_new <- mtcars %>% 
  dplyr::mutate(cyl = as.factor(cyl), gear = as.factor(gear))

Create some missing values:

  mtcars_new$mpg[2:5] <- NA
  mtcars_new$cyl[2:5] <- NA

Create some outliers:

mtcars_new$hp[2:5] <- 500

After simple modification, the first 10 rows of mtcars_new are shown as follows:

head(mtcars_new, 10)

Identifying Missing Values with `missing_detect()`

missing_detect() can be used to detect the missing values of a given dataset. There is only one argument for this function, which can be either base-R dataframe or tibble. The output is a table that lists the number of missing values (n_missing) and the percentage of missing values for each column (percent).

For example, we can assess the missing values in the mtcars_new.

missing_detect(mtcars_new)

Imputing Missing Values with `missing_impute()`

missing_detect() can be used to impute the missing values all together for a given dataset. The first argument is the dataset, which can be either base-R dataframe or tibble. The second argument is the method used for imputing numerical missing values (one of "mean", "median", "drop"). The third argument is the method used for imputing non-numerical missing values (one of "most_frequent", "drop"). The output is an imputed version of the input dataset.

For example, we can impute the mtcars_new by filling the NAs by column mean if NA appears in numerical column, and filling the NAs by most frequent values otherwise. The table below shows the first 10 rows of the imputed mtcars_new.

mtcars_new_imputed <- missing_impute(mtcars_new, method_num = "mean", method_non_num = "most_frequent")
head(mtcars_new_imputed, 10)

Detecting the Outliers with `outliers_detect()`

outlier_detect() can be used to detect the outliers for a given vector/column. The first argument is the vector that the user would like to detect outliers for. The second argument is the method to detect outliers.(one of "zscore", "iqr", "iforest"). The output is a boolean vector with indices marked TRUE for the outliers.

For example, we can assess the outliers for the column hp based on the rule of interquartile range (IQR).

outliers_detect(mtcars_new$hp, method = "iqr")

Removing the Outliers with `remove_outliers()`

remove_outliers() can be used to remove the outliers for a given vector/column. The first argument is the vector that the user would like to delete the outliers for. The second argument is a boolean vector with outliers marked true. The user can directly pass the results of outliers_detect() as the second argument and remove the corresponding outliers easily. The output is a vector with outliers deleted from the input vector.

For example, we can remove the outliers that we identified before for the column hp.

remove_outliers(mtcars_new$hp, outliers_detect(mtcars_new$hp, method = "iqr"))

Plotting Different Features with `histograms()`

histograms can be used to create plots given a list of different kinds of features (columns). The first argument is the dataset, which can be either base-R dataframe or tibble. The second argument is a character vector that contains all the column names that the user would like to plot for. The third argument is an integer that indicates the number of columns in faceted chart produced. The output is a faceted chart that includes one plot each column the user specifies in the second argument. For numerical columns, the corresponding output plots will be the histograms. For categorical columns, the corresponding output plot will be the bar charts.

For example, we can plot and explore the columns: mpg, disp, cyl, and gear together.

histograms(mtcars_new, features=c("mpg", "disp", "cyl", "gear"), num_cols = 2)

Plotting Correlations with `corr_plot()`

corr_plot() can be used to create a pairwise correlation plot given a list of numerical features (columns). The first argument is the dataset, which can be either base-R dataframe or tibble. The second argument is a character vector that contains all the names of the numerical columns that the user would like to plot for. The third argument is the method to calculate the correlation (one of "pearson", "spearman" or "kendall"). The output is a correlation heatmap showing the correlation between the specified columns.

For example, we can explore the pearson correlation coefficients between the columns: mpg, disp, and hp.

corr_plot(mtcars_new, features=c("mpg", "disp","hp"), method = 'pearson')

Key Advantages

This package is implemented with the functions to do both EDA and data preprocessing. After EDA, the user might not need to explore other packages to do such preprocessing tasks.
The plotting functions can take the argument as list of features. So the user does not need to run the same plotting function over and over again. The plots are well-organized already.
Lots of flexibility and customization available.

UBC-MDS/eaziReda documentation built on March 24, 2021, 2:22 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

UBC-MDS/eaziReda
A Quick And Easy Way To Do EDA And Preprocessing

In UBC-MDS/eaziReda: A Quick And Easy Way To Do EDA And Preprocessing

Introduction to eaziReda

Data: mtcars

Change the types of two columns to categorical:

Create some missing values:

Create some outliers:

Identifying Missing Values with `missing_detect()`

Imputing Missing Values with `missing_impute()`

Detecting the Outliers with `outliers_detect()`

Removing the Outliers with `remove_outliers()`

Plotting Different Features with `histograms()`

Plotting Correlations with `corr_plot()`

Key Advantages

R Package Documentation

Browse R Packages

We want your feedback!

UBC-MDS/eaziReda A Quick And Easy Way To Do EDA And Preprocessing

In UBC-MDS/eaziReda: A Quick And Easy Way To Do EDA And Preprocessing

Introduction to eaziReda

Data: mtcars

Change the types of two columns to categorical:

Create some missing values:

Create some outliers:

Identifying Missing Values with missing_detect()

Imputing Missing Values with missing_impute()

Detecting the Outliers with outliers_detect()

Removing the Outliers with remove_outliers()

Plotting Different Features with histograms()

Plotting Correlations with corr_plot()

Key Advantages

R Package Documentation

Browse R Packages

We want your feedback!

UBC-MDS/eaziReda
A Quick And Easy Way To Do EDA And Preprocessing

Identifying Missing Values with `missing_detect()`

Imputing Missing Values with `missing_impute()`

Detecting the Outliers with `outliers_detect()`

Removing the Outliers with `remove_outliers()`

Plotting Different Features with `histograms()`

Plotting Correlations with `corr_plot()`