# Simple data exploration In psyntur: Helper Tools for Teaching Statistical Data Analysis

```knitr::opts_chunk\$set(
collapse = TRUE,
comment = "#>"
)
```

# Introduction

The `summarize` function in `dplyr`, especially when combined with `group_by` and `across`, provides powerful tools for exploring data using summary statistics. The `psyntur` package provides some wrappers to these tools to allow data exploration, albeit of a limited kind, to be done quickly and easily. We explore some of these functions in this vignette.

Load the `psyntur` functions and data sets with the usual `library` command.

```library(psyntur)
```

# Summary statistics with `describe`

We can use the `describe` function in `psyntur`. The first argument to `describe` should be the data frame. Subsequent arguments should be named arguments of summary statistics functions, like `mean`, `median`, etc., applied to any variables in the data frame. For example, using the `faithfulfaces` data frame, we can obtain the arithmetic mean and standard deviation of the `faithful` variable as follows.

```describe(data = faithfulfaces, avg = mean(faithful), stdev = sd(faithful))
```

We can apply the same or different functions to the same or different variables.

```describe(data = faithfulfaces,
avg_faith = mean(faithful),
avg_trust = mean(trustworthy),
sd_trust = sd(trustworthy))
```

We can obtain the summary statistics for the chosen variables for each group of a third variable using a `by` variable.

```describe(data = faithfulfaces, by = face_sex,
avg = mean(faithful), stdev = sd(faithful))
```

The `by` argument may be a vector of variables. In this case, the chosen variables are grouped by the combination of the `by` variables. For example, in the following we group the `time` variable in `vizverb` by both `task` and `response`.

```describe(vizverb, by = c(task, response),
avg = mean(time),
median = median(time),
iqr = IQR(time),
stdev = sd(time)
)
```

# Multiple summary functions to multiple variables

It would be tedious and repetitive to use `describe` as above if wanted to apply the same set of summary statistic functions to a set of variables. Instead, we can use `describe_across`. For example, to calculate the mean, median, standard deviation to two variables, `trustworthy` and `faithful`, in the `faithfulfaces` data set, we can do the following.

```describe_across(faithfulfaces,
variables = c(trustworthy, faithful),
functions = list(avg = mean, median = median, stdev = sd)
)
```

Note that the data frame that is returned is in a wide format. We can pivot this to a longer format by saying `pivot = TRUE`.

```describe_across(faithfulfaces,
variables = c(trustworthy, faithful),
functions = list(avg = mean, median = median, stdev = sd),
pivot = TRUE
)
```

We can use the `by` variable to calculate the summary statistics for each subgroup corresponding to each value of the `by` variable, as in the following example.

```describe_across(faithfulfaces,
variables = c(trustworthy, faithful),
functions = list(avg = mean, median = median, stdev = sd),
by = face_sex,
pivot = TRUE
)
```

As in the case of `describe`, the `by` argument can be a vector of variables.

# Dealing with missing values with `_xna`

When variable have `NA` values, most summary statistics function will, by default, return `NA`. To illustrate this, we can modify `faithfulfaces` to contain `NA`'s for the `faithful` variable.

```faithfulfaces_na <- faithfulfaces %>%
dplyr::mutate(faithful = ifelse(faithful > 6, NA, faithful))
```

Now, if we try one of the above `describe` or `describe_aross` functions with the `faithful` variable, we will obtain corresponding `NA` values.

```describe_across(faithfulfaces_na,
variables = c(trustworthy, faithful),
functions = list(avg = mean, median = median, stdev = sd),
by = face_sex,
pivot = TRUE
)
```

Of course, if we set `na.rm = TRUE` in any or all of the summary functions, we will remove the `NA` values before the statistics are calculated. This is relatively easy to do with `describe`, as in the following example.

```describe(data = faithfulfaces, by = face_sex,
avg = mean(faithful, na.rm = T), stdev = sd(faithful, na.rm = T))
```

However, for `describe` across, we pass in a list of functions, and so to set `na.rm = T`, we can to create `purrr` style anonymous functions calling the summary statistic function with `na.rm = T`, as in the following example.

```library(purrr)
describe_across(faithfulfaces_na,
variables = c(trustworthy, faithful),
functions = list(avg = ~mean(., na.rm = T),
median = ~median(., na.rm = T),
stdev = ~sd(., na.rm = T)),
by = face_sex,
pivot = TRUE
)
```

Anonymous function like this are not very transparent for those new to R, and the resulting function looks quite complex.

In order to avoid using code like `~mean(., na.rm = T)`, for a number of commonly used summary statistic functions (`sum`, `mean`, `median`, `var`, `sd`, `IQR`), we have made counterparts where `na.rm` is set to `TRUE` by default. These functions have the same name as the original with the suffix `_xna` (but `IQR` is `iqr_xna`, not `IQR_xna`). As such, we can do the following.

```describe_across(faithfulfaces_na,
variables = c(trustworthy, faithful),
functions = list(avg = mean_xna, median = median_xna, stdev = sd_xna),
by = face_sex,
pivot = TRUE
)
```

## Try the psyntur package in your browser

Any scripts or data that you put into this service are public.

psyntur documentation built on Sept. 15, 2021, 5:07 p.m.