skimr
is designed to provide summary statistics about variables in data frames,
tibbles, data tables and vectors. It is
opinionated in its defaults, but easy to modify.
In base R, the most similar functions are summary()
for vectors and data
frames and fivenum()
for numeric vectors:
summary(iris)
summary(iris$Sepal.Length)
fivenum(iris$Sepal.Length)
summary(iris$Species)
skim()
functionThe core function of skimr
is skim()
, which is designed to work with
(grouped) data frames, and will try coerce other objects to data frames
if possible. Like summary()
, skim()
's method for data frames presents
results for every column; the statistics it provides depend on the class of
the variable.
By design, the main focus of skimr
is on data frames; it is intended to fit
well within a data pipeline and relies
extensively on tidyverse vocabulary, which
focuses on data frames.
Results of skim()
are printed horizontally, with one section per variable
type and one row per variable.
library(skimr) skim(iris)
The format of the results are a single wide data frame combining the results, with some additional attributes and two metadata columns:
skim_variable
: name of the original variableskim_type
: class of the variableUnlike many other objects within R
, these columns are intrinsic to the
skim_df
class. Dropping these variables will result in a coercion to a
tibble
. The is_skim_df()
function is used to assert that an object is
a skim_df.
skim(iris) %>% is_skim_df()
skim(iris) %>% dplyr::select(-skim_type, -skim_variable) %>% is_skim_df()
skim(iris) %>% dplyr::select(-n_missing) %>% is_skim_df()
In order to avoid type coercion, columns for summary statistics for different
types are prefixed with the corresponding skim_type
. This means that the
columns of the skim_df
are somewhat sparse, with quite a few missing
values. This is because for some statistics the representations for different
types of variables is different. For example, the mean of a Date variable and
of a numeric variable are represented differently when printing, but this
cannot be supported in a single vector. The exception to this are
n_missing
and complete_rate
(missing/number of observations) which are the
same for all types of variables.
skim(iris) %>% tibble::as_tibble()
This is in contrast to summary.data.frame()
, which stores statistics in a
table
. The distinction is important, because the skim_df
object is pipeable
and easy to use for additional manipulation: for example, the user could select
all of the variable means, or all summary statistics for a specific variable.
skim(iris) %>% dplyr::filter(skim_variable == "Petal.Length")
Most dplyr
verbs should work as expected.
skim(iris) %>% dplyr::select(skim_type, skim_variable, n_missing)
The base skimmers n_missing
and complete_rate
are computed for all of the
columns in the data. But all other type-based skimmers have a namespace. You
need to use a skim_type
prefix to refer to correct column.
skim(iris) %>% dplyr::select(skim_type, skim_variable, numeric.mean)
skim()
also supports grouped data created by dplyr::group_by()
.
In this case, one additional column for each grouping variable is added
to the skim_df
object.
iris %>% dplyr::group_by(Species) %>% skim()
Individual columns from a data frame may be selected using tidyverse-style selectors.
skim(iris, Sepal.Length, Species)
Or with common select
helpers.
skim(iris, starts_with("Sepal"))
If an individual column is of an unsupported class, it is treated as a character variable with a warning.
In skimr
v2, skim()
will attempt to coerce non-data frames (such as vectors
and matrices) to data frames. In most cases with vectors, the object being
evaluated should be equivalent to wrapping the object in as.data.frame()
.
For example, the lynx
data set is class ts
.
skim(lynx)
Which is the same as coercing to a data frame.
all.equal(skim(lynx), skim(as.data.frame(lynx)))
skimr
does not support skimming matrices directly but coerces them to data
frames. Columns in the matrix become variables. This behavior is similar to
summary.matrix()
). Three possible ways to handle matrices with skim()
parallel the three variations of the mean function for matrices.
m <- matrix(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12), nrow = 4, ncol = 3) m
Skimming the matrix produces similar results to colMeans()
.
colMeans(m) skim(m) # Similar to summary.matrix and colMeans()
Skimming the transpose of the matrix will give row-wise results.
rowMeans(m) skim(t(m))
And call c()
on the matrix to get results across all columns.
skim(c(m)) mean(m)
skim_tee()
produces the same printed version as skim()
but returns the
original, unmodified data frame. This allows for continued piping of the
original data.
iris_setosa <- iris %>% skim_tee() %>% dplyr::filter(Species == "setosa") head(iris_setosa)
Note, that skim_tee()
is customized differently than skim
itself. See below
for more details.
skim()
As noted above, skim()
returns a wide data frame. This is usually the most
sensible format for the majority of operations when investigating data, but
the package has some other functions to help with edge cases.
First, partition()
returns a named list of the wide data frames for each data
type. Unlike the original data the partitioned data only has columns
corresponding to the skimming functions used for this data type. These data
frames are, therefore, not skim_df
objects.
iris %>% skim() %>% partition()
Alternatively, yank()
selects only the subtable for a specific type. Think of
it like dplyr::select
on column types in the original data. Again, unsuitable
columns are dropped.
iris %>% skim() %>% yank("numeric")
to_long()
returns a single long data frame with columns variable
, type
,
statistic
and formatted
. This is similar but not identical to the skim_df
object in skimr
v1.
iris %>% skim() %>% to_long() %>% head()
Since the skim_variable
and skim_type
columns are a core component of the
skim_df
class, it's possible to get unwanted side effects when using
dplyr::select()
. Instead, use focus()
to select columns of the skimmed
results and keep them as a skim_df
; it always keeps the metadata column.
iris %>% skim() %>% focus(n_missing, numeric.mean)
skim()
The skim_df
object is a wide data frame. The display is
created by default using print.skim_df()
; users can specify additional
options by explicitly calling print([skim_df object], ...)
.
For documents rendered by knitr
, the package provides a custom knit_print
method. To use it, the final line of your code chunk should have a skim_df
object.
skim(Orange)
The same type of rendering is available from reshaped skim_df
objects, those
generated by partition()
and yank()
in particular.
skim(Orange) %>% yank("numeric")
Although its not a common use case outside of writing vignettes about skimr
,
you can fall back to default printing methods by adding the chunk option
render = knitr::normal_print
.
You can also disable the skimr
summary by setting the chunk option
skimr_include_summary = FALSE
.
You can change the number of digits shown in the columns of generated statistics
by changing the skimr_digits
chunk option.
skim()
skimr
is opinionated in its choice of defaults, but users can easily add,
replace, or remove the statistics for a class. For interactive use, you can
create your own skimming function with the skim_with()
factory. skimr
also
has an API for extensions in other packages. Working with that is covered later.
To add a statistic for a data type, create an sfl()
(a skimr
function list)
for each class that you want to change:
my_skim <- skim_with(numeric = sfl(new_mad = mad)) my_skim(faithful)
As the previous example suggests, the default is to append new summary
statistics to the preexisting set. This behavior isn't always desirable,
especially when you want lots of changes. To stop appending, set
append = FALSE
.
my_skim <- skim_with(numeric = sfl(new_mad = mad), append = FALSE) my_skim(faithful)
You can also use skim_with()
to remove specific statistics by setting them to
NULL
. This is commonly used to disable the inline histograms and spark graphs.
no_hist <- skim_with(ts = sfl(line_graph = NULL)) no_hist(Nile)
The same pattern applies to changing skimmers for multiple classes simultaneously. If you want to partially-apply function arguments, use the Tidyverse lambda syntax.
my_skim <- skim_with( numeric = sfl(total = ~ sum(., na.rm = TRUE)), factor = sfl(missing = ~ sum(is.na(.))), append = FALSE ) my_skim(iris)
To modify the "base" skimmers, refer to them in a similar manner. Since base
skimmers are usually a small group, they must return the same type for all
data types in R, append
doesn't apply here.
my_skim <- skim_with(base = sfl(length = length)) my_skim(faithful)
skimr
Packages may wish to export their own skim()
functions. Use skim_with()
for
this. In fact, this is how skimr
generates its version of skim()
.
#' @export my_package_skim <- skim_with()
Alternatively, defaults for another data types can be added to skimr
with the
get_skimmers
generic. The method for your data type should return an sfl()
.
Unlike the sfl()
used interactively, you also need to set the skim_type
argument. It should match the method type in the function signature.
get_skimmers.my_type <- function(column) { sfl( skim_type = "my_type", total = sum ) } my_data <- data.frame( my_type = structure(1:3, class = c("my_type", "integer")) ) skim(my_data)
An extended example is available in the vignette Supporting additional objects.
The details of rendering are dependent on the operating system R is running on, the locale of the installation, and the fonts installed. Rendering may also differ based on whether it occurs in the console or when knitting to specific types of documents such as HTML and PDF.
The most commonly reported problems involve rendering the spark graphs (inline
histogram and line chart) on Windows. One common fix is to switch your locale. The
function fix_windows_histograms()
does this for you.
In order to render the sparkgraphs in html or PDF histogram you may need to change fonts to one that supports blocks or Braille (depending on which you need). Please review the separate vignette and associated template for details.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.