An R package that simplifies up the main EDA procedures such as: outlier identification, data visualization, correlation, missing data imputation.
| Ofer Mansour | Suvarna Moharir | Subing Cao | Manuel Maldonado | | :---------------------------------------: | :---------------------------------------------: | :------------------------------------: | :---------------------------------------------: |
Data understanding and cleaning represents 60% of data scientist’s time given to any project. The goal with this package is to simplify this process , and make a more efficient use of time while working on some of the main procedures done in EDA (outlier identification, data visualization, correlation, missing data imputation).
To start using our package, please follow these instructions:
devtools
is installed on your computer. If not, you can
open the console and input the following:install.packages('devtools')
devtools
by inputting this command into the console:library(devtools)
redahelper
package by inputting this command into the
console:devtools::install_github("UBC-MDS/redahelper")
| Function Name | Input | Output | Description | | --------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | fast_outliers_id | 3 parameters: A dataframe , a list of columns to be included in analysis,method to be used to identify outliers (“Z-score algorithm” or “Interquantile Range”) | dataframe with included columns and outlier values identified, and % of counts considered as outliers for each anlyzed column | Given a dataframe, a list of given columns are analyzed in search for outlier values and return a dataframe summarizing the outliers values found and indicating which % of the counts are affected by this outlier(s) | | fast_plot | 4 parameters: dataframe, name of X column, name of y column, plot name | Plot object | Given a dataframe ,the columns to be considered X an Y respectively, and the desired plot; the function computes and returns the specified plot | | fast_corr | 2 parameters: dataframe, list of columns to be analyzed, | Plot object | Calculates the Pearson correlation of all specified columns and generates a plot visualizing the correlation coefficients. | | fast_missing_impute | 3 parameters: dataframe, a string specifying the missing data treatment method,list of columns to be treated | new dataframe without missing values in the specified columns | Given a dataframe and a list of columns in that dataframe, missing values are identified and treated as specified in the missing data treatment method |
The package can analyze the values of a given column list, and identify outliers using either the ZScore algorithm or interquantile range algorithm. Below is an example of a column to analyze.
sample_data = tibble::tibble("col_a" = c(5000, 50, 6, 8, NaN, 10, 5, 2, 3))
sample_data
## # A tibble: 9 x 1
## col_a
## <dbl>
## 1 5000
## 2 50
## 3 6
## 4 8
## 5 NaN
## 6 10
## 7 5
## 8 2
## 9 3
After using the fast_outlier_id
function, it returns the following
summary:
library(redahelper)
fast_outlier_id(data=sample_data,cols="All",method = "z-score",threshold_low_freq = 0.05)
## # A tibble: 1 x 8
## column_name type no_nans perc_nans outlier_method no_outliers perc_outliers
## <chr> <chr> <int> <dbl> <list> <int> <list>
## 1 col_a nume… 1 0.11 <chr [1]> 1 <dbl [1]>
## # … with 1 more variable: outlier_values <list>
redahelper
can also quickly create scatter, line or bar plots from a
pandas data frame, using the ggplot2 library. As an example, using the
iris dataset:
library(redahelper)
fast_plot(df=iris, x="Sepal.Length", y="Sepal.Width", plot_type="scatter")
The package can also create Pearson correlation matrix easily, by inputting a pandas data frame and desired columns. As an example, using the iris dataset:
library(redahelper)
fast_corr(iris, c('Sepal.Length','Sepal.Width', 'Petal.Length','Petal.Width'))
Finally, redahelper
can impute values to missing data, with method
choices of either remove (removes all rows with missing data), mean,
median, or mode imputation.
Below is a toy dataframe with missing values:
sample_data = tibble::tibble("col_1"= c(1L, NA, 3L, 3L, 5L, NaN),
"col_2"= c("a", NA, "d", "d", "f","e"))
sample_data
## # A tibble: 6 x 2
## col_1 col_2
## <dbl> <chr>
## 1 1 a
## 2 NA <NA>
## 3 3 d
## 4 3 d
## 5 5 f
## 6 NaN e
Using the fast_missing_impute
function, with the “mode” imputation
method, the function returns:
library(redahelper)
fast_missing_impute(df = sample_data, method = "mode", cols = c("col_1", "col_2"))
## # A tibble: 6 x 2
## col_1 col_2
## <dbl> <chr>
## 1 1 a
## 2 3 d
## 3 3 d
## 4 3 d
## 5 5 f
## 6 3 e
At this time, there are multiple packages that are used during EDA with a similar functionality in both R and Python. Nevertheless most of these existing packages require multiple steps or provide results that could be simplified.
In the redahelper
package, the focus is to minimize the code a user
uses to generate significant conclusions in relation to: outliers,
missing data treatment, data visualization, correlation computing and
visualization.
The following table summarizes existing packages that are related to the
procedures that are simplified in the redahelper
package.
| EDA Procedure related | Language | Existing Packages/Functions | | ------------------------- | -------- | ---------------------------------------------------------------------------------------- | | Outlier identification | R | Test for Outliers | | Outlier identification | R | Outlier Detection | | Missing Value Treatment | R | Mice Package | | Missing Value Treatment | R | Amelia Package | | Data Visualization | R | ggplot | | Correlation Visualization | R | corplot |
How will the redahelper
package compare to the previous existing
packages/functions?
The redahelper
package aims to provide an user friendly experience by
reducing the code needed to conduct an exploratory data analysis,
specifically for identifying outliers, imputing missing data, and
generating visualizations for relations and correlations.
The fast_plot function leverages the ggplot package in R, however it improves on it by giving the user the ease to change plot type by changing an argument, and including error handling to ensure appropriate column types for certain plots. While the R packages GGalley, ggplot2 and corrplot have similar functions in creating the Pearson correlation matrix, the fast_corr function provides a more user-friendly (less coding) experience and makes it easier to select the columns (features) for the analysis. It will filter out of the categorical columns and only perform the analysis on the numeric columns. On ther hand, the R packages MICE, Amelia, and Hmisc have a similar function to imputing missing data. However, the fast_missing_impute function is likely more convenient for the user as it involves less coding, requiring the user to simply select the method of imputation and the columns with missing data. Finally, in relation to outlier identification, the fast_outliers_id function will serve as another options for users by creating an integral solution by mixing current existing methods into a single function. It will automatize the usage of Z-score and Interquantile methods to identify outliers.
This package was created with Cookiecutter and the UBC-MDS/cookiecutter-ubc-mds project template, modified from the pyOpenSci/cookiecutter-pyopensci project template and the audreyr/cookiecutter-pypackage.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.