knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(redahelper)

Introduction to redahelper

When conducting an exploratory data analysis, users must:

The redahelper package aims to provide a more user-friendly experience through the following:

This document outliers redahelper's toolkit, and provides examples on how to use them.

Data: airquality

To explore the various functions of the redahelper package, this vignette wil luse the airquality dataset that is part of the base-R datasets. It is documented in ?airquality.

summary(airquality)

dim(airquality)

head(airquality)

Identifying Outliers with fast_outliers_id()

fast_outliers_id() allows the user to identify outliers in the data. The arguments are the dataframe (or tibble), the columns the user wants information on, the method of identifying outliers (either z-score or interquartile, and the threshold for evaluating outliers in categorical columns). The output is a dataframe summarizing the outliers in the dataframe, allowing the user to make the choice on how to proceed.

For example, finding outlier data with a method of z-score for all columns in the dataframe:

fast_outlier_id(data = airquality, cols = "ALL", method = "z-score", threshold_low_freq = 0.05)

Plotting variables with fast_plot()

fast_plot() allows a user to create an exploratory data analysis plot using two columns from the dataframe (or tibble) using ggplot2. The arguments are the dataframe (or tibble) of interest, the two columns of interest (x and y axes), and the type of plot to be generated, a choice from scatter, line, or bar. The output is a ggplot2 plot of the selected columns using the selected plot type. The function contains error handling to ensure the user is selecting an appropriate plot (e.g. will not allow for a bar chart when both x and y are non-numeric).

For example, plotting a scatter plot of Ozone and Temp:

fast_plot(df = airquality, x = "Ozone" , y = "Temp", plot_type = "scatter")

Another example, a bar plot of the Wind by Month:

fast_plot(df = airquality, x = "Month" , y = "Wind", plot_type = "bar")

Exploring correlations with fast_corr()

fast_corr() allows the user to view the Pearson correlation coefficient of selected variables in the dataframe or tibble. The arguments are the dataframe (or tibble) and the columns of interest. The output is a correlation matrix displaying the various correlations.

For example, assessing the correlations among Wind, Temp, and Month:

fast_corr(df = airquality, selected_columns = c("Wind", "Temp", "Month"))

Identify and impute missing data with fast_missing_impute()

fast_missing_impute() allows for a user to impute missing values from a column of the dataframe/tibble with a selected method. The arguments are the the dataframe(or tibble), the method of imputation (mean, median, mode, or remove to remove all rows with missing data in the selected columns), and the columns of interest. The output is a dataframe with the imputed values.

For example, imputing the missing values in Ozone and Solar.R with the median of the respective columns:

imputed_median <- fast_missing_impute(df = airquality, method = "median", cols = c("Ozone", "Solar.R"))
head(imputed_median)

Another example, using the same columns, but now removing the rows in Ozone and Solar.R with missing data:

imputed_remove <- fast_missing_impute(df = airquality, method = "remove", cols = c("Ozone", "Solar.R"))
head(imputed_remove)

Comparisons

Compared to existing options, redahelper:



UBC-MDS/redahelper documentation built on April 2, 2020, 3:59 a.m.