knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

While you will probably work with GGenemy's main feature, the shiny app launched by GGenemy(), it is also beneficial to have a sound understanding of how the package's most important functions work. All of these functions are called during your workflow inside the app.

Start off by loading and attaching GGenemy.

library(GGenemy)

The diamonds11 dataset

To demonstrate how the core functions of the package work, we are going to use the diamonds11 dataset, which is included in GGenemy. Load the dataset and make yourself familiar with it.

data("diamonds11")
head(diamonds11)

As you can see, the data is almost identical to diamonds from ggplot2, but includes an additional column called area. This variable is a factor that indicates where the diamonds were found. Each category represents a country, namely

Since the variable is not yet saved as a factor, we have to correct this before moving on.

diamonds11$area <- as.factor(diamonds11$area)

Getting an Overview with describe()

Now that we have finished the groundwork, we can take a closer look at the data structure and some unconditional summary statistics of diamonds11. This can easily be done with the describe() function. It takes a dataset as its first argument and a character vector with a choice of summary statistics to display for numerical variables as a second argument.

describe(dataset, num.desc = c("min", "quantile0.25", "median", "mean",
                               "quantile0.75", "max", "var", "sd", "valid.n"))

Using describe() on diamonds11 gives us a deeper understanding of the dataset's numerical variables and factors.

describe(diamonds11, num.desc = c("min", "mean", "median", "max", "valid.n"))

describe() calculates the choice of unconditional summary statistics for the numerics. For factors, it counts the occurences of each category, computes their share in the dataset and displays the mode for each factor.

Conditional Summary Statistics

So far, we have only considered unconditional summary statistics. But GGenemy's main functionality is to tackle the issue of conditionality.

Numerical table outputs with sum_stats()

The sum_stats() function enables the calculation of conditional means, variances, skewness and kurtosis for numerical variables given another numeric variable. The command is of the following form:

sum_stats(dataset, given_var, stats = c("mean", "var", "skewness",
  "kurtosis"), n_quantiles = 5)

Its first argument is a dataset, which can include numerics, factors and logical variables. However, factors and logicals will be ignored for the conditional calculation. As a second input, you need to specify the name of a numerical variable from the dataset as a string. This will be the variable the summary statistics are conditoned on. Thirdly, the stats argument controls which conditional statistics will be computed. You can specify your choice either with a character vector (see above), or with a numerical vector with a (sub)set of c(1, 2, 3, 4). Lastly, n_quantiles defines how many quantiles the given_var will be split into. You can choose any integer between 1 and 10.

Executing the command for diamonds11 results into the following output:

diamonds_stats <- sum_stats(diamonds11, "carat", stats = c("mean", "var"), n_quantiles = 5)

diamonds_stats

However, this output is not particularly nice to look at. You can fix this by using print_sum_stats(x, given_var), which takes an object created by sum_stats() as its first argument and the name of the given variable as the second.

print_sum_stats(diamonds_stats, "carat")

Graphic outputs with plot_sum_stats()

While it is great to be able to compute conditional summary statistics with sum_stats(), the amount of information can be difficult to take in. plot_sum_stats() provides a graphical alternative to the tables, which displays the values for each variable in a line plot. The function takes the same arguments as sum_stats().

plot_sum_stats(diamonds11, "carat", stats = c("mean", "var"), n_quantiles = 5)

Conditional Densities, Boxplots and Bar Plots

The last main function integrated in GGenemy is called plot_GGenemy(). It enables the user to investigate the conditional distributions of both factors and numerical variables. It's syntax is a little more extensive than that of the previous functions:

plot_GGenemy(dataset, given_var, var_to_plot = NULL, n_quantiles = 5,
  boxplot = FALSE, selfrange = NULL, remaining = TRUE)

While dataset, given_var and n_quantiles are used in the same way as earlier, there are some new arguments that can be specified:

Basic arguments of plot_GGenemy()

Here are some examples on how to work with plot_GGenemy():

  1. Numerical variable price is plotted with its conditional density.
plot_GGenemy(diamonds11, "carat", var_to_plot = "price")
  1. Numerical variable price is printed as a boxplot.
plot_GGenemy(diamonds11, "carat", var_to_plot = "price", boxplot = TRUE)
Working with selfrange and remaining

Here are some more examples for the usage of the advanced arguments:

  1. Creating your own ranges for the segmentation of the given variable
plot_GGenemy(diamonds11, "carat", var_to_plot = "price", selfrange = c(0, 1, 1, 3, 3, 5),
             remaining = FALSE)
  1. Collecting leftover data in a remaining quantile
plot_GGenemy(diamonds11, "carat", var_to_plot = "price", selfrange = c(0, 1, 1, 2, 2, 3),
             remaining = TRUE)


tajohu/GGenemy documentation built on Nov. 5, 2019, 9:44 a.m.