Using Skimr

Introduction

skimr is designed to provide summary statistics about variables in data frames, tibbles, data tables and vectors. It is opinionated in its defaults, but easy to modify.

In base R, the most similar functions are summary() for vectors and data frames and fivenum() for numeric vectors:

summary(iris)
summary(iris$Sepal.Length)
fivenum(iris$Sepal.Length)
summary(iris$Species)

The skim() function

The core function of skimr is skim(), which is designed to work with (grouped) data frames, and will try coerce other objects to data frames if possible. Like summary(), skim()'s method for data frames presents results for every column; the statistics it provides depend on the class of the variable.

Skimming data frames

By design, the main focus of skimr is on data frames; it is intended to fit well within a data pipeline and relies extensively on tidyverse vocabulary, which focuses on data frames.

Results of skim() are printed horizontally, with one section per variable type and one row per variable.

library(skimr)
skim(iris)

The format of the results are a single wide data frame combining the results, with some additional attributes and two metadata columns:

Unlike many other objects within R, these columns are intrinsic to the skim_df class. Dropping these variables will result in a coercion to a tibble. The is_skim_df() function is used to assert that an object is a skim_df.

skim(iris) %>% is_skim_df()
skim(iris) %>%
  dplyr::select(-skim_type, -skim_variable) %>% is_skim_df()
skim(iris) %>%
  dplyr::select(-n_missing) %>% is_skim_df()

In order to avoid type coercion, columns for summary statistics for different types are prefixed with the corresponding skim_type. This means that the columns of the skim_df are somewhat sparse, with quite a few missing values. This is because for some statistics the representations for different types of variables is different. For example, the mean of a Date variable and of a numeric variable are represented differently when printing, but this cannot be supported in a single vector. The exception to this are n_missing and complete_rate (missing/number of observations) which are the same for all types of variables.

skim(iris) %>%
  tibble::as_tibble()

This is in contrast to summary.data.frame(), which stores statistics in a table. The distinction is important, because the skim_df object is pipeable and easy to use for additional manipulation: for example, the user could select all of the variable means, or all summary statistics for a specific variable.

skim(iris) %>%
  dplyr::filter(skim_variable == "Petal.Length")

Most dplyr verbs should work as expected.

skim(iris) %>%
  dplyr::select(skim_type, skim_variable, n_missing)

The base skimmers n_missing and complete_rate are computed for all of the columns in the data. But all other type-based skimmers have a namespace. You need to use a skim_type prefix to refer to correct column.

skim(iris) %>%
  dplyr::select(skim_type, skim_variable, numeric.mean)

skim() also supports grouped data created by dplyr::group_by(). In this case, one additional column for each grouping variable is added to the skim_df object.

iris %>%
  dplyr::group_by(Species) %>%
  skim()

Individual columns from a data frame may be selected using tidyverse-style selectors.

skim(iris, Sepal.Length, Species)

Or with common select helpers.

skim(iris, starts_with("Sepal"))

If an individual column is of an unsupported class, it is treated as a character variable with a warning.

Skimming vectors

In skimr v2, skim() will attempt to coerce non-data frames (such as vectors and matrices) to data frames. In most cases with vectors, the object being evaluated should be equivalent to wrapping the object in as.data.frame().

For example, the lynx data set is class ts.

skim(lynx)

Which is the same as coercing to a data frame.

all.equal(skim(lynx), skim(as.data.frame(lynx)))

Skimming matrices

skimr does not support skimming matrices directly but coerces them to data frames. Columns in the matrix become variables. This behavior is similar to summary.matrix()). Three possible ways to handle matrices with skim() parallel the three variations of the mean function for matrices.

m <- matrix(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12), nrow = 4, ncol = 3)
m

Skimming the matrix produces similar results to colMeans().

colMeans(m)
skim(m) # Similar to summary.matrix and colMeans()

Skimming the transpose of the matrix will give row-wise results.

rowMeans(m)
skim(t(m))

And call c() on the matrix to get results across all columns.

skim(c(m))
mean(m)

Skimming without modification

skim_tee() produces the same printed version as skim() but returns the original, unmodified data frame. This allows for continued piping of the original data.

iris_setosa <- iris %>%
  skim_tee() %>%
  dplyr::filter(Species == "setosa")
head(iris_setosa)

Note, that skim_tee() is customized differently than skim itself. See below for more details.

Reshaping the results from skim()

As noted above, skim() returns a wide data frame. This is usually the most sensible format for the majority of operations when investigating data, but the package has some other functions to help with edge cases.

First, partition() returns a named list of the wide data frames for each data type. Unlike the original data the partitioned data only has columns corresponding to the skimming functions used for this data type. These data frames are, therefore, not skim_df objects.

iris %>%
  skim() %>%
  partition()

Alternatively, yank() selects only the subtable for a specific type. Think of it like dplyr::select on column types in the original data. Again, unsuitable columns are dropped.

iris %>%
  skim() %>%
  yank("numeric")

to_long() returns a single long data frame with columns variable, type, statistic and formatted. This is similar but not identical to the skim_df object in skimr v1.

iris %>%
  skim() %>%
  to_long() %>% 
  head()

Since the skim_variable and skim_type columns are a core component of the skim_df class, it's possible to get unwanted side effects when using dplyr::select(). Instead, use focus() to select columns of the skimmed results and keep them as a skim_df; it always keeps the metadata column.

iris %>%
  skim() %>%
  focus(n_missing, numeric.mean)

Rendering the results of skim()

The skim_df object is a wide data frame. The display is created by default using print.skim_df(); users can specify additional options by explicitly calling print([skim_df object], ...).

For documents rendered by knitr, the package provides a custom knit_print method. To use it, the final line of your code chunk should have a skim_df object.

skim(Orange)

The same type of rendering is available from reshaped skim_df objects, those generated by partition() and yank() in particular.

skim(Orange) %>%
  yank("numeric")

Customizing print options

Although its not a common use case outside of writing vignettes about skimr, you can fall back to default printing methods by adding the chunk option render = knitr::normal_print.

You can also disable the skimr summary by setting the chunk option skimr_include_summary = FALSE.

You can change the number of digits shown in the columns of generated statistics by changing the skimr_digits chunk option.

Modifying skim()

skimr is opinionated in its choice of defaults, but users can easily add, replace, or remove the statistics for a class. For interactive use, you can create your own skimming function with the skim_with() factory. skimr also has an API for extensions in other packages. Working with that is covered later.

To add a statistic for a data type, create an sfl() (a skimr function list) for each class that you want to change:

my_skim <- skim_with(numeric = sfl(new_mad = mad))
my_skim(faithful)

As the previous example suggests, the default is to append new summary statistics to the preexisting set. This behavior isn't always desirable, especially when you want lots of changes. To stop appending, set append = FALSE.

my_skim <- skim_with(numeric = sfl(new_mad = mad), append = FALSE)
my_skim(faithful)

You can also use skim_with() to remove specific statistics by setting them to NULL. This is commonly used to disable the inline histograms and spark graphs.

no_hist <- skim_with(ts = sfl(line_graph = NULL))
no_hist(Nile)

The same pattern applies to changing skimmers for multiple classes simultaneously. If you want to partially-apply function arguments, use the Tidyverse lambda syntax.

my_skim <- skim_with(
  numeric = sfl(total = ~ sum(., na.rm = TRUE)),
  factor = sfl(missing = ~ sum(is.na(.))),
  append = FALSE
)

my_skim(iris)

To modify the "base" skimmers, refer to them in a similar manner. Since base skimmers are usually a small group, they must return the same type for all data types in R, append doesn't apply here.

my_skim <- skim_with(base = sfl(length = length))
my_skim(faithful)

Extending skimr

Packages may wish to export their own skim() functions. Use skim_with() for this. In fact, this is how skimr generates its version of skim().

#' @export
my_package_skim <- skim_with()

Alternatively, defaults for another data types can be added to skimr with the get_skimmers generic. The method for your data type should return an sfl(). Unlike the sfl() used interactively, you also need to set the skim_type argument. It should match the method type in the function signature.

get_skimmers.my_type <- function(column) {
  sfl(
    skim_type = "my_type",
    total = sum
  )
}

my_data <- data.frame(
  my_type = structure(1:3, class = c("my_type", "integer"))
)
skim(my_data)

An extended example is available in the vignette Supporting additional objects.

Solutions to common rendering problems

The details of rendering are dependent on the operating system R is running on, the locale of the installation, and the fonts installed. Rendering may also differ based on whether it occurs in the console or when knitting to specific types of documents such as HTML and PDF.

The most commonly reported problems involve rendering the spark graphs (inline histogram and line chart) on Windows. One common fix is to switch your locale. The function fix_windows_histograms() does this for you.

In order to render the sparkgraphs in html or PDF histogram you may need to change fonts to one that supports blocks or Braille (depending on which you need). Please review the separate vignette and associated template for details.



Try the skimr package in your browser

Any scripts or data that you put into this service are public.

skimr documentation built on Dec. 28, 2022, 2:45 a.m.