In UBC-MDS/slimreda: Exploratory Data Analysis

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

slimreda

Exploratory Data Analysis is an important preparatory work to help data scientists understand and clean up data sets before machine learning begins. However, this step also involves a lot of repetitive tasks. In this context, slimeda will help data scientists quickly complete the initial work of EDA and gain a preliminary understanding of the data.

Slimeda focuses on unique value and missing value counts, as well as making graphs like histogram and correlation graphs. Also, the generated results are designed as charts or images, which will help users more flexibly reference their EDA results.

Function Specification The package is under developement and includes the following functions:

histogram : This function accepts a dataframe and builds histograms for all numeric columns which are returned as an array of chart objects.

corr_map : This function accepts a dataframe and builds an heat map for all numeric columns which is returned as a chart object.

cat_unique_count : This function accepts a dataframe and returns a table of unique value counts for all categorical columns.

miss_counts : This function accepts a dataframe and returns a table of counts of missing values in all columns.

Limitations: We only consider numeric and categorical columns in our package.

Installation

You can install the released version of slimreda (after Milestone 4 is done) from CRAN with:

install.packages("slimreda")

And the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("UBC-MDS/slimreda")

Usage

To import the package:

library(slimreda)
## basic example code

For each function:

histogram:
- Suppose you would like to plot the distrubtion of certain columns in your data frame as histograms. Instead of writing multiple code chunks with duplicate ggplot code, you can use the histogram function to plot histograms for as many columns as you would like.
- In the example below, we generate two histograms for two columns in the penguins data frame, namely body_mass_g and flipper_length_mm. We use plot_grid to render these plots on the same row, but you can plot them directly:

library(palmerpenguins)
library(cowplot)

hist_plots <- slimreda::histogram(penguins, c('body_mass_g', 'flipper_length_mm'))

cowplot::plot_grid(plotlist = hist_plots, nrow = 1)

miss_count:
- With this function, you can know the number of missing values and corresponding percentage for a data frame. There are two parameters: df is the data frame you want to analyze, and ascending is a boolean value to decide whether the df is sorted ascending or decending.
- Below is an example for this function:

example_miss_count <-data.frame(
        name = c(NA,NA,"Jessica"),
        age = c(NA,21,30),
        hobby = c("lab","quiz","swim")
)

output <- slimreda::miss_count(example_miss_count,
                               ascending = TRUE)

output

cat_unique_count:
- The cat_unique_count comes in handy when you are interested in the number(s) of unique values you have in every categorical column in your data frame. With this function, you can skip duplicating the same line of code only to edit the column name and have all the categorical features and unique value counts returned as a data frame.
- In the example below, we generate the unique value counts for all categorical features in the penguins data frame, namely species, island and sex. We use knitr::kable to render the data frame into a table:

unique_cat_df <- slimreda::cat_unique_count(penguins)

knitr::kable(unique_cat_df, "simple")

corr_map:
- Now suppose you would like to see the correlation between some columns in your data frame as in a correlation map, showing the pairwise correlation strength, instead of writing lines of duplicate ggplot code you can use the corr_map function from the multiple code chunks with duplicate ggplot code, you can use the histogram function to plot histograms for as many columns as you would like.
- In the example below, we generate a simple correlation map for all the numeric columns in the penguins data frame. The color indicates the correlation between -1 to 1 and the output is a ggplot object that can be modified later:

corr_map_plot <- slimreda::corr_map(penguins, colnames(penguins))

corr_map_plot

Fitting in R Ecosystem

Packages have similar functions are: DataExplorer (https://cran.r-project.org/web/packages/DataExplorer/vignettes/dataexplorer-intro.html)
Slimreda's innovation points:
- We aggregate necessary functions for eda in one function that can only be done with multiple packages and simplify the code. For example, for missing value counts, we not only get the counts but also calculate its percentage.
- We optimize the output to be more clear.
- Compared with DataExplorer, we generate the most commonly used graphs in an easily and flexible way.

Contributing

Interested in contributing? Check out the contributing guidelines. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.