An R package that simplifies up the main EDA procedures such as: outlier identification, data visualization, correlation, missing data imputation.
| Ofer Mansour | Suvarna Moharir | Subing Cao | Manuel Maldonado| |:------------:|:--------------:|:--------------:|:--------------:|
We are aware that data understanding and cleaning represents 60% of data scientist's time given to any project. Our goal with this package is to simplify this process , and make a more efficient use of time while working on some of the main procedures done in EDA (outlier identification, data visualization, correlation, missing data imputation).
To start using our package, please follow these instructions:
devtools
is installed on your computer. If not, you can open the console and input the following:install.packages('devtools')
devtools
by inputting this command into the console:library(devtools)
redahelper
package by inputting this command into the console:devtools::install_github("UBC-MDS/redahelper")
| Function Name | Input | Output | Description | |---------|------------|------|-----------| |fast_outliers_id|3 parameters: A dataframe , a list of columns to be included in analysis,method to be used to identify outliers ("Z-score algorithm" or "Interquantile Range")| dataframe with included columns and outlier values identified, and % of counts considered as outliers for each anlyzed column| Given a dataframe, a list of given columns are analyzed in search for outlier values and return a dataframe summarizing the outliers values found and indicating which % of the counts are affected by this outlier(s)| |fast_plot|4 parameters: dataframe, name of X column, name of y column, plot name | Plot object | Given a dataframe ,the columns to be considered X an Y respectively, and the desired plot; the function computes and returns the specified plot| |fast_corr| 2 parameters: dataframe, list of columns to be analyzed, |correlation plot object| Calculates the correlation of all specified columns and generates a plot visualizing the correlation coefficients.| |fast_missing_impute|3 parameters: dataframe, a string specifying the missing data treatment method,list of columns to be treated| new dataframe without missing values in the specified columns|Given a dataframe and a list of columns in that dataframe, missing values are identified and treated as specified in the missing data treatment method |
At this time, there are multiple packages that are used during EDA with a similar functionality in both R and Python. Nevertheless most of these existing packages require multiple steps or provide results that could be simplified.
In our REDAHELPR package, our focus is to minimize the code an user uses to generate significant conclusions in relation to: outliers, missing data treatment, data visualization, correlation computing and visualization.
In the following table we have summarized existing packages that are related to the procedures that are simplified in our redahelper package.
|EDA Procedure related|Language|Existing Packages/Functions| |---------|------------|---------------------------| |Outlier identification| R| Test for Outliers| |Outlier identification| R| Outlier Detection| |Missing Value Treatment | R | Mice Package| |Missing Value Treatment | R | Amelia Package| |Data Visualization|R|ggplot| |Correlation Visualization|R|corplot|
How will our package compare to the previous existing packages/functions?
The redahelper package aims to provide an user friendly experience by reducing the code needed to conduct an exploratory data analysis, specifically for identifying outliers, imputing missing data, and generating visualizations for relations and correlations.
The fast_plot function leverages the ggplot package in R, however it improves on it by giving the user the ease to change plot type by changing an argument, and including error handling to ensure appropriate column types for certain plots. While the R packages "GGalley", "ggplot2" and "corrplot" have similar functions in creating the correlation matrix, our function for correlation analysis provides a more user-friendly (less coding) experience and makes it easier to select the columns (features) for the analysis. It will filter out of the categorical columns and only perform the analysis on the numeric columns. On ther hand the R packages "MICE", "Amelia", and "Hmisc" have a similar function to imputing missing data. However, our function is likely more convenient for the user as it involves less coding, requiring the user to simply select the method of imputation and the columns with missing data. Finally, in relation to outlier identification our package will serve as another options for users by creating an integral solution by mixing current existing methods into a single function. It will automatize the usage of Z-score and Interquantile methods to identify outliers.
This package was created with Cookiecutter and the UBC-MDS/cookiecutter-ubc-mds project template, modified from the pyOpenSci/cookiecutter-pyopensci project template and the audreyr/cookiecutter-pypackage.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.