knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(eaziReda)
Exploratory data analysis (EDA) and data preprocessing are two crucial steps in the data analysis workflow. To be specific, the users usually need to do the following tasks:
checking missing values - detecting outliers
plotting correlation plots between features
plotting histograms/bar plots for each individual features
imputation
dealing with outliers
The eaziReda that wraps all of those tasks into several convenient functions that allow the users to quickly and easily carry out EDA along with some simple preprocessing:
It can report the number/percentage of missing values for all columns in a given dataframe as well as giving the user an option of imputing the missing values all together based on the imputation methods the user specifies.
It can identify all outliers for a given column/vector based on the methods the user specifies as well as giving the user an option of removing the outliers all together.
It can automatically generate a correlation plot based on user's choice of features.
It can automatically generate histograms and bar plots based on user's choice of features.
To illustrate how different functions of the eaziReda
package work, we
will use the mtcars
dataset, which is included in the base R datasets.
Also, in order to show the full picture of functions, we choose to
modify the mtcars
dataset and create a new dataset called mtcars_new
by making the data more complicated and unclean. The modifications are
shown as below:
mtcars_new <- mtcars %>% dplyr::mutate(cyl = as.factor(cyl), gear = as.factor(gear))
mtcars_new$mpg[2:5] <- NA mtcars_new$cyl[2:5] <- NA
mtcars_new$hp[2:5] <- 500
After simple modification, the first 10 rows of mtcars_new
are shown
as follows:
head(mtcars_new, 10)
missing_detect()
missing_detect()
can be used to detect the missing values of a given
dataset. There is only one argument for this function, which can be
either base-R dataframe or tibble. The output is a table that lists
the number of missing values (n_missing
) and the percentage of missing
values for each column (percent
).
For example, we can assess the missing values in the mtcars_new
.
missing_detect(mtcars_new)
missing_impute()
missing_detect()
can be used to impute the missing values all together
for a given dataset. The first argument is the dataset, which can be
either base-R dataframe or tibble. The second argument is the method
used for imputing numerical missing values (one of "mean", "median",
"drop"). The third argument is the method used for imputing
non-numerical missing values (one of "most_frequent", "drop"). The
output is an imputed version of the input dataset.
For example, we can impute the mtcars_new
by filling the NAs by column
mean if NA appears in numerical column, and filling the NAs by most
frequent values otherwise. The table below shows the first 10 rows of
the imputed mtcars_new
.
mtcars_new_imputed <- missing_impute(mtcars_new, method_num = "mean", method_non_num = "most_frequent") head(mtcars_new_imputed, 10)
outliers_detect()
outlier_detect()
can be used to detect the outliers for a given
vector/column. The first argument is the vector that the user would like
to detect outliers for. The second argument is the method to detect
outliers.(one of "zscore", "iqr", "iforest"). The output is a boolean
vector with indices marked TRUE for the outliers.
For example, we can assess the outliers for the column hp
based on the
rule of interquartile range (IQR).
outliers_detect(mtcars_new$hp, method = "iqr")
remove_outliers()
remove_outliers()
can be used to remove the outliers for a given
vector/column. The first argument is the vector that the user would like
to delete the outliers for. The second argument is a boolean vector with
outliers marked true. The user can directly pass the results of
outliers_detect() as the second argument and remove the corresponding
outliers easily. The output is a vector with outliers deleted from the
input vector.
For example, we can remove the outliers that we identified before for
the column hp
.
remove_outliers(mtcars_new$hp, outliers_detect(mtcars_new$hp, method = "iqr"))
histograms()
histograms
can be used to create plots given a list of different kinds
of features (columns). The first argument is the dataset, which can be
either base-R dataframe or tibble. The second argument is a character
vector that contains all the column names that the user would like to
plot for. The third argument is an integer that indicates the number of
columns in faceted chart produced. The output is a faceted chart that
includes one plot each column the user specifies in the second argument.
For numerical columns, the corresponding output plots will be the
histograms. For categorical columns, the corresponding output plot will
be the bar charts.
For example, we can plot and explore the columns: mpg
, disp
, cyl
,
and gear
together.
histograms(mtcars_new, features=c("mpg", "disp", "cyl", "gear"), num_cols = 2)
corr_plot()
corr_plot()
can be used to create a pairwise correlation plot given a
list of numerical features (columns). The first argument is the dataset,
which can be either base-R dataframe or tibble. The second argument is a
character vector that contains all the names of the numerical columns
that the user would like to plot for. The third argument is the method
to calculate the correlation (one of "pearson", "spearman" or
"kendall"). The output is a correlation heatmap showing the
correlation between the specified columns.
For example, we can explore the pearson correlation coefficients between
the columns: mpg
, disp
, and hp
.
corr_plot(mtcars_new, features=c("mpg", "disp","hp"), method = 'pearson')
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.