spread_draws: Extract draws of variables in a Bayesian model fit into a...
In tidybayes: Tidy Data and 'Geoms' for Bayesian Models

gather_draws

R Documentation

Extract draws of variables in a Bayesian model fit into a tidy data format

Description

Extract draws from a Bayesian model for one or more variables (possibly with named dimensions) into one of two types of long-format data frames.

Usage

gather_draws(
  model,
  ...,
  regex = FALSE,
  sep = "[, ]",
  ndraws = NULL,
  seed = NULL,
  draw_indices = c(".chain", ".iteration", ".draw"),
  n
)

spread_draws(
  model,
  ...,
  regex = FALSE,
  sep = "[, ]",
  ndraws = NULL,
  seed = NULL,
  draw_indices = c(".chain", ".iteration", ".draw"),
  n
)

Arguments

`model`	A supported Bayesian model fit. Tidybayes supports a variety of model objects; for a full list of supported models, see tidybayes-models.
`...`	Expressions in the form of `variable_name[dimension_1, dimension_2, ...] \| wide_dimension`. See Details.
`regex`	If `TRUE`, variable names are treated as regular expressions and all column matching the regular expression and number of dimensions are included in the output. Default `FALSE`.
`sep`	Separator used to separate dimensions in variable names, as a regular expression.
`ndraws`	The number of draws to return, or `NULL` to return all draws.
`seed`	A seed to use when subsampling draws (i.e. when `ndraws` is not `NULL`).
`draw_indices`	Character vector of column names that should be treated as indices of draws. Operations are done within combinations of these values. The default is `c(".chain", ".iteration", ".draw")`, which is the same names used for chain, iteration, and draw indices returned by `tidy_draws()`. Names in `draw_indices` that are not found in the data are ignored.
`n`	(Deprecated). Use `ndraws`.

Details

Imagine a JAGS or Stan fit named model. The model may contain a variable named b[i,v] (in the JAGS or Stan language) with dimension i in 1:100 and dimension v in 1:3. However, the default format for draws returned from JAGS or Stan in R will not reflect this indexing structure, instead they will have multiple columns with names like "b[1,1]", "b[2,1]", etc.

spread_draws and gather_draws provide a straightforward syntax to translate these columns back into properly-indexed variables in two different tidy data frame formats, optionally recovering dimension types (e.g. factor levels) as it does so.

spread_draws and gather_draws return data frames already grouped by all dimensions used on the variables you specify.

The difference between spread_draws is that names of variables in the model will be spread across the data frame as column names, whereas gather_draws will gather variables into a single column named ".variable" and place values of variables into a column named ".value". To use naming schemes from other packages (such as broom), consider passing results through functions like to_broom_names() or to_ggmcmc_names().

For example, spread_draws(model, a[i], b[i,v]) might return a grouped data frame (grouped by i and v), with:

column ".chain": the chain number. NA if not applicable to the model type; this is typically only applicable to MCMC algorithms.
column ".iteration": the iteration number. Guaranteed to be unique within-chain only. NA if not applicable to the model type; this is typically only applicable to MCMC algorithms.
column ".draw": a unique number for each draw from the posterior. Order is not guaranteed to be meaningful.
column "i": value in 1:5
column "v": value in 1:10
column "a": value of "a[i]" for draw ".draw"
column "b": value of "b[i,v]" for draw ".draw"

gather_draws(model, a[i], b[i,v]) on the same model would return a grouped data frame (grouped by i and v), with:

column ".chain": the chain number
column ".iteration": the iteration number
column ".draw": the draw number
column "i": value in 1:5
column "v": value in 1:10, or NA if ".variable" is "a".
column ".variable": value in c("a", "b").
column ".value": value of "a[i]" (when ".variable" is "a") or "b[i,v]" (when ".variable" is "b") for draw ".draw"

spread_draws and gather_draws can use type information applied to the model object by recover_types() to convert columns back into their original types. This is particularly helpful if some of the dimensions in your model were originally factors. For example, if the v dimension in the original data frame data was a factor with levels c("a","b","c"), then we could use recover_types before spread_draws:

model %>%
 recover_types(data) 
 spread_draws(model, b[i,v])

Which would return the same data frame as above, except the "v" column would be a value in c("a","b","c") instead of 1:3.

For variables that do not share the same subscripts (or share some but not all subscripts), we can supply their specifications separately. For example, if we have a variable d[i] with the same i subscript as b[i,v], and a variable x with no subscripts, we could do this:

spread_draws(model, x, d[i], b[i,v])

Which is roughly equivalent to this:

spread_draws(model, x) %>%
 inner_join(spread_draws(model, d[i])) %>%
 inner_join(spread_draws(model, b[i,v])) %>%
 group_by(i,v)

Similarly, this:

gather_draws(model, x, d[i], b[i,v])

Is roughly equivalent to this:

bind_rows(
 gather_draws(model, x),
 gather_draws(model, d[i]),
 gather_draws(model, b[i,v])
)

The c and cbind functions can be used to combine multiple variable names that have the same dimensions. For example, if we have several variables with the same subscripts i and v, we could do either of these:

spread_draws(model, c(w, x, y, z)[i,v])

spread_draws(model, cbind(w, x, y, z)[i,v])  # equivalent

Each of which is roughly equivalent to this:

spread_draws(model, w[i,v], x[i,v], y[i,v], z[i,v])

Besides being more compact, the c()-style syntax is currently also faster (though that may change).

Dimensions can be omitted from the resulting data frame by leaving their names blank; e.g. spread_draws(model, b[,v]) will omit the first dimension of b from the output. This is useful if a dimension is known to contain all the same value in a given model.

The shorthand .. can be used to specify one column that should be put into a wide format and whose names will be the base variable name, plus a dot ("."), plus the value of the dimension at ... For example:

spread_draws(model, b[i,..]) would return a grouped data frame (grouped by i), with:

column ".chain": the chain number
column ".iteration": the iteration number
column ".draw": the draw number
column "i": value in 1:20
column "b.1": value of "b[i,1]" for draw ".draw"
column "b.2": value of "b[i,2]" for draw ".draw"
column "b.3": value of "b[i,3]" for draw ".draw"

An optional clause in the form ⁠| wide_dimension⁠ can also be used to put the data frame into a wide format based on wide_dimension. For example, this:

spread_draws(model, b[i,v] | v)

is roughly equivalent to this:

spread_draws(model, b[i,v]) %>% spread(v,b)

The main difference between using the | syntax instead of the .. syntax is that the | syntax respects prototypes applied to dimensions with recover_types(), and thus can be used to get columns with nicer names. For example:

model %>% recover_types(data) %>% spread_draws(b[i,v] | v)

would return a grouped data frame (grouped by i), with:

column ".chain": the chain number
column ".iteration": the iteration number
column ".draw": the draw number
column "i": value in 1:20
column "a": value of "b[i,1]" for draw ".draw"
column "b": value of "b[i,2]" for draw ".draw"
column "c": value of "b[i,3]" for draw ".draw"

The shorthand . can be used to specify columns that should be nested into vectors, matrices, or n-dimensional arrays (depending on how many dimensions are specified with .).

For example, spread_draws(model, a[.], b[.,.]) might return a data frame, with:

column ".chain": the chain number.
column ".iteration": the iteration number.
column ".draw": a unique number for each draw from the posterior.
column "a": a list column of vectors.
column "b": a list column of matrices.

Ragged arrays are turned into non-ragged arrays with missing entries given the value NA.

Finally, variable names can be regular expressions by setting regex = TRUE; e.g.:

spread_draws(model, `b_.*`[i], regex = TRUE)

Would return a tidy data frame with variables starting with b_ and having one dimension.

Value

A data frame.

Author(s)

Matthew Kay

Examples


library(dplyr)
library(ggplot2)

data(RankCorr, package = "ggdist")

RankCorr %>%
  spread_draws(b[i, j])

RankCorr %>%
  spread_draws(b[i, j], tau[i], u_tau[i])


RankCorr %>%
  gather_draws(b[i, j], tau[i], u_tau[i])

RankCorr %>%
  gather_draws(tau[i], typical_r) %>%
  median_qi()

tidybayes documentation built on Sept. 15, 2024, 9:08 a.m.