analyzer
In analyzer: Data Analysis and Automated R Notebook Generation

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

analyzer

The goal of analyzer is to make data analysis easy and accessible to everyone by automatically performing the common tasks involved in data analysis. With one command this can also create an R notebook (.rmd) file and convert it into html or pdf file. The notebooks can also be made interactive.

Installation

You can install the released version of analyzer from CRAN with:

install.packages("analyzer")

And the development version from Github with:

# install.packages("devtools")
devtools::install_github("apxr/analyzer")

library(analyzer)
library(ggplot2)

Example - Notebook

This is a basic example which shows how to solve a common problem. By running the following command a notebook can be generated:

GenerateReport(dtpath = "mtcars.csv",
               catVars = c("cyl", "vs", "am", "gear"),
               yvar = "vs", model = "binClass",
               output_format = 'html_document',
               title = "Report", 
               output_dir = tempdir(),
               interactive.plots = FALSE)

In the above command we are asking to generate an HTML file (output_format = 'html_document') for the data mtcars.csv stored in the working directory (dtpath = "mtcars.csv"). We defined the variables ("cyl", "vs", "am", "gear") as categorical through the parameter 'catVars' and variable "vs" as the dependent variable.

For more details on other parameters, look into the help of GenerateReport function. The generated notebook will have all the relevant code snippets for this particular data along with the outputs. The notebook can be used as a reference and modified as required.

Example - Explaining variables

analyzer provides a method named explainer which can be used to explain any vector or data.frame in detail. This function prints the summary which also includes box plot (continuous vector) and frequency plot (categorical vector).

# For a continuous vector
explainer(mtcars$mpg, quant.seq = c(0, 0.25, 0.5, 1),
          include.numeric = c("trimmed.means", "skewness", "kurtosis"))

In above explainer we have asked to only give quantiles corresponding to 0 (minimum), 0.25, 0.50 (median) and 1 (maximum) probability. We also instructed the function to give extra statistics like trimmed mean, skewness and kurtosis.

# For a categorical vector
explainer(as.factor(mtcars$cyl))

The percentage mentioned after the bars (50%) shows the maximum percent for the bar plots. A bar of full length means 50%.

# For a complete data.frame
explainer(mtcars)

Example - Plots

analyzer can also be used to automatically generate plots for all the columns in the data. If the data has a dependent (response) variable pass it through the r 'yvar' argument and its type through r 'yclass' argument.

# Simple plot for one variable
p <- plottr(mtcars$mpg)
plot(p$x)

# default plots for all the variables in mtcars
p <- plottr(mtcars, yvar = "disp", yclass = "numeric")
plot(p$mpg)

The left plot is the univariate plot of 'mpg'. Right plot is bivariate plot with the dependent variable 'disp'.

External functions can be used to create custom plots. the function plottr can take 6 functions for plots of different types of variables. Look into the help of plottr for more details.

Below function updates the right side of plot which is the bivariate plot between the dependent and independent variable (both continuous).

# Define a function for plot for continuous independent and Continuous dependent variables
custom_plot_for_continuous_vars <- function(dat, xname, yname, ...) {

  xyplot <- ggplot(dat, aes_string(x=xname, y=yname)) +
    geom_point(alpha = 0.6, color = "#3c4fde") + geom_rug() +
    theme_minimal()
  xyplot <- gridExtra::arrangeGrob(xyplot,
                                   top = paste0("New plot of ",
                                                xname, " and ", yname)
  )

  return(xyplot)

}

The parameters dat, xname, yname and ... must be present. Additional arguments can be added.

p <- plottr(mtcars, yvar = "disp", yclass = "numeric", 
            FUN3 = custom_plot_for_continuous_vars)
plot(p$mpg)

FUN3 argument is used for generating plots for Continuous independent and Continuous dependent variables. This will update the right plot. Left (Univariate) plots can also be changed by using FUN1 (for Continous) and FUN2 (for Categorical) variables.

# Define a function for plot for continuous independent and Continuous dependent variables
custom_plot2 <- function(dat, xname, ...) {

  # histogram
  p1 <- ggplot(dat, aes_string(x=xname)) +
    geom_histogram(fill="#77bf85", color = "black") +
    theme_minimal()

  return(p1)

}

p <- plottr(mtcars, yvar = "disp", yclass = "numeric",
            FUN1 = custom_plot2, 
            FUN3 = custom_plot_for_continuous_vars)
plot(p$mpg)

Example - Correlation and Association

analyzer can also be used to get the measure of association between the variables for the complete dataset. For two continuous variables it can find the pearson, spearman and kendall correlation based on normality assumption. For two categorical variables it can be used to get the Chi Square p-value or Cramer's V value. Between one continuous and one categorical analyzer can use t-test, Mann-Whitney, Kruskal-Wallis and ANOVA test. The association test depends on multiple criteria including number of unique values in categorical feature, normality test and equal variance test. Equal variance test, in turn, is done by Bartlett's test or Fligner-Killeen test based on normality assumption. Normality test can also be performed using multiple tests (see the normality test section).

The most simple use is:

corr_all <- association(mtcars, categorical = c('cyl', 'vs', 'am', 'gear'), normality_test_method = 'ks')

In above example we have defined 'cyl', 'vs', 'am', 'gear' variables as categorical and set the Kolmogorov-Smirnov test as the normality test method. This returns a list of 6 data.frames. The function selects the method automatically based on different normality and equal variance tests. The selected methods can be seen by

corr_all$method_used

The methods for association can be changed by using the method1 (between 2 continuous variable) or method3 (between continuous and categorical variables). The default is 'auto' for automatic selection. Method can also be changed at the variables pair level by using methodMats argument in following way. Define a data.frame in following way and pass it through methodMats:

corr_all$method_used

This data.frame can take values like the one returned originally:

between continuous-continuous variables ("auto", "pearson", "kendall", "spearman")
between continuous-categorical variables ("auto", "t-test", "ANOVA", "Mann-Whitney", "Kruskal-Wallis")
between categorical-categorical variables can be anything

But its advisable to pass values like below because t-test and ANOVA are both for parametric test but depends on number of unique values in the categorical variable. If 'parametric' is used instead of 'ANOVA' or 't-test' then function identifies between 'ANOVA' or 't-test' automatically.

between continuous-continuous variables ("auto", "pearson", "kendall", "spearman")
between continuous-categorical variables ("auto", "parametric", "non-parametric")
between categorical-categorical variables can be anything

Example - Normality test

analyzer also gives the function to do normality test using three different methods:

Shapiro-Wilk test
Anderson-Darling test
Kolmogorov-Smirnov test

norm_test_fun(mtcars$mpg, method = "anderson")

Example - Analyzing merge

analyzer can be used to get the summary of data pre and post merge.

A <- data.frame(id = c("A", "A", "B", "D", "B"),
                valA = c(30, 83, 45, 2, 58))

B <- data.frame(id = c("A", "C", "A", "B", "C", "C"),
                valB = c(10, 20, 30, 40, 50, 60))


merged <- mergeAnalyzer(A, B, allow.cartesian = T)

Above print shows that left data (A) has sum of valA column = 218. After merge it became 329, which is 1.51 times (shown in column remainingWRTx). Similarly, right data (B) has sum of valB column = 210. After merge it became 160 which is 0.76 of original sum (shown in column remainingWRTy). First row shows the number of rows. in left (A), right (B) and merged data.