ulam: Build RStan models from formulas

View source: R/ulam-function.R

ulamR Documentation

Build RStan models from formulas

Description

Compiles lists of formulas into Stan model code. Allows for arbitary fixed effect and mixed effect regressions. Allows for explicit typing of variables within the formula list. Much more flexible than map2stan.

Usage

ulam( flist , data , pars , pars_omit , start , chains=1 , cores=1 , iter=1000 , 
  control=list(adapt_delta=0.95) , distribution_library=ulam_dists , 
  macro_library=ulam_macros , custom , declare_all_data=TRUE , log_lik=FALSE , 
  sample=TRUE , messages=TRUE , pre_scan_data=TRUE , coerce_int=TRUE , 
  sample_prior=FALSE , file=NULL , cmdstan=FALSE , threads=1 , 
  stanc_options=list("O1") , ... )

Arguments

flist

A formula or list of formulas that define the likelihood and priors. Can also pass in a quap or a previous ulam model fit. See details.

data

A data frame or list containing the data

pars

Optional: character vector of parameters to return samples for

pars_omit

Optional: character vector of parameters to exclude from samples

start

Optional. Either: (1) a named list specifying parameters and their initial values or (2) a function to return such a named list.

chains

Number of independent chains to sample from

cores

Number of processor cores to distribute chains over.

iter

Number of iterations of sampling. By default, half of these iterations are warmup.

control

Optional list of control parameters for stan. Default increases target acceptance rate (adapt_delta) to 0.95.

distribution_library

List of distribution templates.

macro_library

List of function and distribution macros.

custom

Optional list of custom resources. See details and examples.

declare_all_data

When TRUE, all variables in the data list are declared in the Stan model code. When FALSE, only used variables are declared.

log_lik

Return log likelihood of each observation in samples. Used for calculating WAIC and LOO.

sample

If FALSE, builds Stan code without sampling

messages

Show various warnings and informational messages

pre_scan_data

Scan data at start to (1) check for variables that are integer but not type integer and (2) strip any scale() attributes

coerce_int

If pre_scan_data, forces integer variables to be type integer

sample_prior

If TRUE, removes data probabilities from model to sample only prior distribution of parameters. Used by corresponding extract.prior method.

messages

If TRUE, prints various internal steps to help with debugging

file

If character string X, loads X.rds instead of fitting when exists; otherwise saves result to X.rds

cmdstan

When TRUE, uses cmdstanr instead of rstan to run chains. To make cmdstan the default engine, use set_ulam_cmdstan(TRUE).

threads

When threads > 1, attempts to multithread individual chains using Stan's reduce_sum function. Requires cmdstan=TRUE.

...

Additional arguments to pass to stan

Details

ulam provides a slim version of Stan syntax that omits blocks but still allows for explicit variable types and dimensions. The basic model formula is the same as map2stan, but the syntax does not assume a GLMM structure. This allows it to be more flexible, but it also means more mistakes are possible. With great power comes great responsibility.

The function of a model formula is to related the variables to one another. There are three types of variables: (1) data, (2) parameters, and (3) local variables. Model formulas are composed of multiple lines. Each line defines a variable using either a distributional assumption like:

y ~ normal( mu , sigma )

or rather a determinsitic assignment like:

mu <- a + b*x

The basic structure of such a definition is:

type[dimension]: name[dimension] ~ distribution( arguments )

The type delcaration is optional. So the most basic defintion can be just y ~ bernoulli(theta), but a full declariation can be more detailed, when necessary. The examples below show how matrix variables can be defined in this syntax.

For determinstic assignments like:

mu <- a + b*x

It is also possible to use a control word to specify how the values are returned. Using save> returns the values in the posterior samples. For example:

save> mu <- a + b*x

will return mu for each case and posterior sample. This works by duplicating the code in both the model block, where it is used to compute the log-probability, and in generated quantities.

It is also possible to use gq to evaluate the assignment only after sampling, in Stan's generated quantities block. This is useful for derived values that are not needed in computing the posterior but may be useful afterwards. For example, constrasts could be calculated this way. In the examples, the line:

gq> bp_diff <- bp[1] - bp[2]

is used to calculate the posterior distribution of the difference between the two parameters. The code is added to Stan's generated quantities, so that it doesn't slow down the model block.

The control tag transpars can be used to place an assignment in Stan's transformed parameters block. Keep in mind that any other intermediate calculations must also be placed in the same block. Finally, transdata places the assignment in the transformed data block. This means it will only execute once, before sampling begins. But it also means that the values will not be available post-sampling for helper functions like link. As such, it will usually be better to transform data before passing it into ulam.

When cmdstan=TRUE, the cmdstanr package will be used instead rstan to compile and sample from models. This is generally superior, as more recent versions of Stan can be used this way. But some features are not yet implemented, such as passing custom inits to the chains. To make cmdstan the default engine, use set_ulam_cmdstan(TRUE). You can then ignore the cmdstan argument when calling ulam.

The use of cmdstan=TRUE is also the only way to currently use multi-threading of individual chains, using the threads argument. When cmdstan=TRUE and threads is set greater than 1, ulam will try to recode the model so that each chain is spread over multiple cores. This can easily halve sampling time. At the moment, this only works for models with a single outcome variable.

Methods are defined for extract.samples, extract.prior, link, sim, compare, coef, summary, logLik, lppd, vcov, nobs, deviance, WAIC, PSIS, plot, traceplot, trankplot, pairs, and show.

Value

Returns an object of class ulam with the following slots.

call

The function call

model

Stan model code

stanfit

stanfit object returned by stan

coef

The posterior means

vcov

k-by-1 matrix containing the variance of each of k variables in posterior

data

The data

start

List of starting values that were used in sampling

pars

Parameter names monitored in samples

formula

Formula list from call

formula_parsed

List of parsed formula information. Useful mainly for debugging. Needed by helper functions.

Author(s)

Richard McElreath

See Also

quap, map2stan, stan

Examples

## Not run: 
library(rethinking)
data(chimpanzees)

# don't want any variables with NAs
# also recode condition to an index {1,0} -> {1,2}
d <- list( 
    pulled_left = chimpanzees$pulled_left ,
    prosoc_left = chimpanzees$prosoc_left ,
    condition = as.integer( 2 - chimpanzees$condition ) ,
    actor = as.integer( chimpanzees$actor ) ,
    blockid = as.integer( chimpanzees$block )
)

# simple logistic regression
m1 <- ulam(
    alist(
        pulled_left ~ bernoulli(theta),
        logit(theta) <- a + bp[condition]*prosoc_left  ,
        a ~ normal(0,4),
        bp[condition] ~ normal(0,1)
    ) ,
    data=d, chains=2, cores=1 , sample=TRUE )

precis(m1,depth=2)
plot(m1,depth=2)
pairs(m1)

# same model, but save theta so it is return in samples
# note 'save>' in second line of formula
m1b <- ulam(
    alist(
        pulled_left ~ bernoulli(theta),
        save> logit(theta) <- a + bp[condition]*prosoc_left  ,
        a ~ normal(0,4),
        bp[condition] ~ normal(0,1)
    ) ,
    data=d, chains=2, cores=1 , sample=TRUE )

# same model, but use gq to compute contrast between conditions
# note that order does matter. bp_diff should come before bp[] is defined
m1c <- ulam(
    alist(
        pulled_left ~ bernoulli(theta),
        logit(theta) <- a + bp[condition]*prosoc_left  ,
        gq> bp_diff <- bp[1] - bp[2],
        a ~ normal(0,4),
        bp[condition] ~ normal(0,1)
    ) ,
    data=d, chains=2, cores=1 , sample=TRUE )

# can also transform data inside model, using transdata> tag.
# this is more efficient, because it only evaluates once, not during sampling.
# for example, this constructs prosoc_right variable:
m1d <- ulam(
    alist(
        pulled_left ~ bernoulli(theta),
        logit(theta) <- a + bp[condition]*prosoc_right  ,
        transdata> prosoc_right <- 1 - prosoc_left,
        a ~ normal(0,4),
        bp[condition] ~ normal(0,1)
    ) ,
    data=d, chains=2, cores=1 , sample=TRUE )

# now model with varying intercepts on actor
m2 <- ulam(
    alist(
        pulled_left ~ bernoulli(theta),
        logit(theta) <- a + aj[actor] + bp[condition]*prosoc_left,
        aj[actor] ~ normal( 0 , sigma_actor ),
        a ~ normal(0,4),
        bp[condition] ~ normal(0,1),
        sigma_actor ~ exponential(1)
    ) ,
    data=d, chains=2 , cores=1 , sample=TRUE )

precis(m2)
plot(m2)

# varying intercepts on actor and experimental block
m3 <- ulam(
    alist(
        pulled_left ~ bernoulli(theta),
        logit(theta) <- a + aj[actor] + ak[blockid] + bp[condition]*prosoc_left,
        aj[actor] ~ normal( 0 , sigma_actor ),
        ak[blockid] ~ normal( 0 , sigma_block ),
        a ~ dnorm(0,4),
        bp[condition] ~ dnorm(0,1),
        sigma_actor ~ exponential(1),
        sigma_block ~ exponential(1)
    ) ,
    data=d, chains=2 , cores=1 , sample=TRUE )

precis(m3)
summary(m3)
plot(m3)

###########
# varying slopes models

# varying slopes on actor
# also demonstrates use of multiple linear models
# see Chapter 13 for discussion
m3 <- ulam(
    alist(
        # likeliood
        pulled_left ~ bernoulli(theta),

        # linear models
        logit(theta) <- A + BP*prosoc_left,
        A <- a + v[actor,1],
        BP <- bp + v[actor,condition+1],

        # adaptive prior
        vector[3]: v[actor] ~ multi_normal( 0 , Rho_actor , sigma_actor ),

        # fixed priors
        c(a,bp) ~ normal(0,1),
        sigma_actor ~ exponential(1),
        Rho_actor ~ lkjcorr(4)
    ) , data=d , chains=3 , cores=1 , sample=TRUE )

# same model but with non-centered parameterization
# see Chapter 13 for explanation and more elaborate example
m4 <- ulam(
    alist(
        # likeliood
        pulled_left ~ bernoulli(theta),

        # linear models
        logit(theta) <- A + BP*prosoc_left,
        A <- a + v[actor,1],
        BP <- bp + v[actor,condition+1],

        # adaptive prior
        matrix[actor,3]: v <- compose_noncentered( sigma_actor , L_Rho_actor , z ),
        matrix[3,actor]: z ~ normal( 0 , 1 ),

        # fixed priors
        c(a,bp) ~ normal(0,1),
        vector[3]: sigma_actor ~ exponential(1),
        cholesky_factor_corr[3]: L_Rho_actor ~ lkj_corr_cholesky( 4 )
    ) , data=d , chains=3 , cores=1 , sample=TRUE )

# same as m5, but without hiding the construction of v
m5 <- ulam(
    alist(
        # likeliood
        pulled_left ~ bernoulli(theta),

        # linear models
        logit(theta) <- A + BP*prosoc_left,
        A <- a + v[actor,1],
        BP <- bp + v[actor,condition+1],

        # adaptive prior
        matrix[actor,3]: v <- t(diag_pre_multiply( sigma_actor , L_Rho_actor ) * z),
        matrix[3,actor]: z ~ normal( 0 , 1 ),

        # fixed priors
        c(a,bp,bpc) ~ normal(0,1),
        vector[3]: sigma_actor ~ exponential(1),
        cholesky_factor_corr[3]: L_Rho_actor ~ lkj_corr_cholesky( 4 )
    ) , data=d , chains=3 , cores=1 , sample=TRUE )


## End(Not run)

rmcelreath/rethinking documentation built on Sept. 18, 2023, 2:01 p.m.