Home

/

GitHub

/

In UBC-MDS/nurser: Make EDA process

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

`nurser`

An R package for streamlining the front end of the machine learning workflow.

Summary

Common to the front end of most machine learning pipelines is an exploratory data analysis (EDA) and feature preprocessing. EDA's facilitate a better understanding of the data being analyzed and allows for a targeted and more robust model development while feature imputation and preprocessing is a requirement for many machine learning alogirthms. nurser aims to streamline the front end of the machine learning pipeline by generating descriptive summary tables and figures, various feature imputation summaries, and automating preprocessing. Automated preprocessing detection has been implemented to minimize time and optimize the processing methods used. The functions in nurser were developed to provide useful and informative metrics that are applicable to a wide array of datasets.

A vignettes for this package can be found here.

nurser was developed as part of DSCI 524 of the MDS program at UBC.

Installation:

You can install the released version of nurser from CRAN with:

install.packages("nurser")

The development version can be downloaded from GitHub with:

# install.packages("devtools")
devtools::install_github("UBC-MDS/nurser")

Features

The package includes the following three functions:

|Function|Input|Output|Description| |--------|-----|------|-----------| |eda|- a dataframe|- a list that contains histogram and summary statistics for each column|- Functionality for easy explanatory data analysis. | |impute_summary|- a dataframe|- a list with summary statistics and outputs of different imputation methods|- Functionality for consolidating several imputation methods| |preproc|- a tibble or dataframe|- a tibble with preprocessed features|- Functionality for automatic feature preprocessing detection and user defined feature preprocessing|

R Ecosystem

nurser was developed to align with:

tidyverse

The impute_summary function leveraged the imputation methods found in the following packages:

However, the functions herein streamline and automate the front-end machine learning pipeline for use with any machine learning package.

Dependencies

ggplot2=3.3.0
tibble=2.1.3
fastDummies=1.6.1
stats=3.6.2
Hmisc=4.3-1
mi=1.0
mice=3.8.0
missForest=1.4

Usage

library(nurser)
library(magrittr)

`eda`

The eda() function return a list that contains histogram and summary statistics for a given column. Let's see it in action!

To view a histogram of a feature:

result <- eda(mtcars)

result <- eda(mtcars)

hist_mpg <- result$histograms[[1]]
hist_mpg

Now let's see the summary statistics of this feature:

stats_mpg = result$stats$mpg
stats_mpg

`impute_summary`

Let's import some continuous data to work with,

iris_data <- iris[1:4]

and add some missing values,

iris_missing <- 
  as.data.frame(lapply(iris_data, 
                       function(x) x[sample(c(TRUE, NA),
                                            size = length(x), 
                                            replace = TRUE,
                                            prob = c(0.75, 0.25))]))

Now, let's take a look at the data to in fact see if the missing values were generated and where they are:

iris_missing %>% head(10)

Great, we have some missing values to compute - let's call impute_summary to get some summary statistics and outputs from different methods.

iris_imputed <- impute_summary(iris_missing)

iris_imputed <- impute_summary(iris_missing)

impute_summary() provides some useful summary statistics and also several imputed dataframes that can be accessed by the impute_summary object attributes. The imputed data frames provided include:

mean,
median,
max,
min,
random,
multiple imputation,
pmm, and
random forest

Let's first take a look at the summaries, which can be accessed using $nan_counts (NA counts for each feature) and $nan_rowindex (rows that contain NA values):

iris_imputed$nan_counts

iris_imputed$nan_rowindex %>% head(5)

Now, let's take a look at two of the imputed data frames, mean and multiple imputation:

iris_imputed$hmisc_mean %>% head(10)

iris_imputed$mi_multimp %>% head(10)

`preproc`

The preproc() function returns a tibble with preprocessed features. Simply call preproc on your data!

Let's first view our data before preprocessing:

head(iris)

and now after calling preproc:

results = preproc(iris)
head(results)

Documentation

UBC-MDS/nurser documentation built on April 3, 2020, 4:22 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

UBC-MDS/nurser
Make EDA process

In UBC-MDS/nurser: Make EDA process

`nurser`

Summary

Installation:

Features

R Ecosystem

Dependencies

Usage

`eda`

`impute_summary`

`preproc`

Documentation

R Package Documentation

Browse R Packages

We want your feedback!

UBC-MDS/nurser Make EDA process

In UBC-MDS/nurser: Make EDA process

nurser

Summary

Installation:

Features

R Ecosystem

Dependencies

Usage

eda

impute_summary

preproc

Documentation

R Package Documentation

Browse R Packages

We want your feedback!

UBC-MDS/nurser
Make EDA process

`nurser`

`eda`

`impute_summary`

`preproc`