stan: Fit a model using Stan

Description Usage Arguments Details Value Note References See Also Examples

View source: R/rstan.R

Description

Fit a model defined in the Stan modeling language and return the fitted result as an instance of stanfit.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
stan(file, model_name = "anon_model", model_code = "", 
  fit = NA, data = list(), pars = NA, chains = 4,
  iter = 2000, warmup = floor(iter/2), thin = 1, 
  init = "random", seed = sample.int(.Machine$integer.max, 1), 
  algorithm = c("NUTS", "HMC", "Fixed_param"), 
  control = NULL,
  sample_file = NULL, diagnostic_file = NULL, 
  save_dso = TRUE, 
  verbose = FALSE, include = TRUE,
  cores = getOption("mc.cores", 1L),
  open_progress = interactive() && !isatty(stdout()) &&
                  !identical(Sys.getenv("RSTUDIO"), "1"),
  ...,
  boost_lib = NULL,
  eigen_lib = NULL)

Arguments

file

A character string file name or a connection that R supports containing the text of a model specification in the Stan modeling language; a model may also be specified directly as a character string using parameter model_code or through a previous fit using parameter fit. When fit is specified, parameter file is ignored.

model_name

A character string naming the model; defaults to "anon_model". However, the model name would be derived from file or model_code (if model_code is the name of a character string object) if model_name is not specified.

model_code

A character string either containing the model definition or the name of a character string object in the workspace. This parameter is used only if parameter file is not specified. When fit is specified, the model compiled previously is used so specifying model_code is ignored.

fit

An instance of S4 class stanfit derived from a previous fit; defaults to NA. If fit is not NA, the compiled model associated with the fitted result is re-used; thus the time that would otherwise be spent recompiling the C++ code for the model can be saved.

data

A named list or environment providing the data for the model, or a character vector for all the names of objects used as data. See the Note section below.

pars

A vector of character strings specifying parameters of interest. The default is NA indicating all parameters in the model. If include = TRUE, only samples for parameters named in pars are stored in the fitted results. Conversely, if include = FALSE, samples for all parameters except those named in pars are stored in the fitted results.

chains

A positive integer specifying the number of Markov chains. The default is 4.

iter

A positive integer specifying the number of iterations for each chain (including warmup). The default is 2000.

warmup

A positive integer specifying the number of warmup (aka burnin) iterations per chain. If step-size adaptation is on (which it is by default), this also controls the number of iterations for which adaptation is run (and hence these warmup samples should not be used for inference). The number of warmup iterations should not be larger than iter and the default is iter/2.

thin

A positive integer specifying the period for saving samples. The default is 1, which is usually the recommended value.

init

Can be the digit 0, the strings "0" or "random", a function that returns a named list, or a list of named lists.

init="random" (default):

Let Stan generate random initial values for all parameters. The seed of the random number generator used by Stan can be specified via the seed argument. If the seed for Stan is fixed, the same initial values are used. The default is to randomly generate initial values between -2 and 2 on the unconstrained support. The optional additional parameter init_r can be set to some value other than 2 to change the range of the randomly generated inits.

init="0", init=0:

Initialize all parameters to zero on the unconstrained support.

inits via list:

Set inital values by providing a list equal in length to the number of chains. The elements of this list should themselves be named lists, where each of these named lists has the name of a parameter and is used to specify the initial values for that parameter for the corresponding chain.

inits via function:

Set initial values by providing a function that returns a list for specifying the initial values of parameters for a chain. The function can take an optional parameter chain_id through which the chain_id (if specified) or the integers from 1 to chains will be supplied to the function for generating initial values. See the Examples section below for examples of defining such functions and using a list of lists for specifying initial values.

When specifying initial values via a list or function, any parameters for which values are not specified will receive initial values generated as described in the init="random" description above.

seed

The seed for random number generation. The default is generated from 1 to the maximum integer supported by R on the machine. Even if multiple chains are used, only one seed is needed, with other chains having seeds derived from that of the first chain to avoid dependent samples. When a seed is specified by a number, as.integer will be applied to it. If as.integer produces NA, the seed is generated randomly. The seed can also be specified as a character string of digits, such as "12345", which is converted to integer.

algorithm

One of sampling algorithms that are implemented in Stan. Current options are "NUTS" (No-U-Turn sampler, Hoffman and Gelman 2011, Betancourt 2017), "HMC" (static HMC), or "Fixed_param". The default and preferred algorithm is "NUTS".

sample_file

An optional character string providing the name of a file. If specified the draws for all parameters and other saved quantities will be written to the file. If not provided, files are not created. When the folder specified is not writable, tempdir() is used. When there are multiple chains, an underscore and chain number are appended to the file name.

diagnostic_file

An optional character string providing the name of a file. If specified the diagnostics data for all parameters will be written to the file. If not provided, files are not created. When the folder specified is not writable, tempdir() is used. When there are multiple chains, an underscore and chain number are appended to the file name.

save_dso

Logical, with default TRUE, indicating whether the dynamic shared object (DSO) compiled from the C++ code for the model will be saved or not. If TRUE, we can draw samples from the same model in another R session using the saved DSO (i.e., without compiling the C++ code again). This parameter only takes effect if fit is not used; with fit defined, the DSO from the previous run is used. When save_dso=TRUE, the fitted object can be loaded from what is saved previously and used for sampling, if the compiling is done on the same platform, that is, same operating system and same architecture (32bits or 64bits).

verbose

TRUE or FALSE: flag indicating whether to print intermediate output from Stan on the console, which might be helpful for model debugging.

control

A named list of parameters to control the sampler's behavior. It defaults to NULL so all the default values are used. First, the following are adaptation parameters for sampling algorithms. These are parameters used in Stan with similar names here.

  • adapt_engaged (logical)

  • adapt_gamma (double, positive, defaults to 0.05)

  • adapt_delta (double, between 0 and 1, defaults to 0.8)

  • adapt_kappa (double, positive, defaults to 0.75)

  • adapt_t0 (double, positive, defaults to 10)

  • adapt_init_buffer (integer, positive, defaults to 75)

  • adapt_term_buffer (integer, positive, defaults to 50)

  • adapt_window (integer, positive, defaults to 25)

In addition, algorithm HMC (called 'static HMC' in Stan) and NUTS share the following parameters:

  • stepsize (double, positive)

  • stepsize_jitter (double, [0,1])

  • metric (string, one of "unit_e", "diag_e", "dense_e")

For algorithm HMC, we can also set

  • int_time (double, positive)

For algorithm NUTS, we can set

  • max_treedepth (integer, positive)

For test_grad mode, the following parameters can be set

  • epsilon (double, defaults to 1e-6)

  • error (double, defaults to 1e-6)

include

Logical scalar defaulting to TRUE indicating whether to include or exclude the parameters given by the pars argument. If FALSE, only entire multidimensional parameters can be excluded, rather than particular elements of them.

cores

Number of cores to use when executing the chains in parallel, which defaults to 1 but we recommend setting the mc.cores option to be as many processors as the hardware and RAM allow (up to the number of chains).

open_progress

Logical scalar that only takes effect if cores > 1 but is recommended to be TRUE in interactive use so that the progress of the chains will be redirected to a file that is automatically opened for inspection. For very short runs, the user might prefer FALSE.

...

Other optional parameters:

  • chain_id (integer)

  • init_r (double, positive)

  • test_grad (logical)

  • append_samples (logical)

  • refresh(integer)

  • save_warmup(logical)

  • deprecated: enable_random_init(logical)

chain_id can be a vector to specify the chain_id for all chains or an integer. For the former case, they should be unique. For the latter, the sequence of integers starting from the given chain_id are used for all chains.

init_r is used only for generating random initial values, specifically when init="random" or not all parameters are initialized in the user-supplied list or function. If specified, the initial values are simulated uniformly from interval [-init_r, init_r] rather than using the default interval (see the manual of (cmd)Stan).

test_grad (logical). If test_grad=TRUE, Stan will not do any sampling. Instead, the gradient calculation is tested and printed out and the fitted stanfit object is in test gradient mode. By default, it is FALSE.

append_samples (logical). Only relevant if sample_file is specified and is an existing file. In that case, setting append_samples=TRUE will append the samples to the existing file rather than overwriting the contents of the file.

refresh (integer) can be used to control how to indicate the progress during sampling (i.e. show the progress every refresh iterations). By default, refresh = max(iter/10, 1). The progress indicator is turned off if refresh <= 0.

Deprecated: enable_random_init (logical) being TRUE enables specifying initial values randomly when the initial values are not fully specified from the user.

save_warmup (logical) indicates whether to save draws during the warmup phase and defaults to TRUE. Some memory related problems can be avoided by setting it to FALSE, but some diagnostics are more limited if the warmup draws are not stored.

boost_lib

The path for an alternative version of the Boost C++ to use instead of the one in the BH package.

eigen_lib

The path for an alternative version of the Eigen C++ library to the one in RcppEigen.

Details

stan does all of the work of fitting a Stan model and returning the results as an instance of stanfit. First, it translates the Stan model to C++ code. Second, the C++ code is compiled into a binary shared object, which is loaded into the current R session (an object of S4 class stanmodel is created). Finally, samples are drawn and wrapped in an object of S4 class stanfit, which has methods such as print, summary, and plot to inspect and retrieve the results of the fitted model.

stan can also be used to sample again from a fitted model under different settings (e.g., different iter, data, etc.) by using the fit argument to specify an existing stanfit object. In this case, the compiled C++ code for the model is reused.

Value

An object of S4 class stanfit. However, if cores > 1 and there is an error for any of the chains, then the error(s) are printed. If all chains have errors and an error occurs before or during sampling, the returned object does not contain samples. But the compiled binary object for the model is still included, so we can reuse the returned object for another sampling.

Note

The data passed to stan are preprocessed before being passed to Stan. If data is not a character vector, the data block of the Stan program is parsed and R objects of the same name are searched starting from the calling environment. Then, if data is list-like but not a data.frame the elements of data take precedence. This behavior is similar to how a formula is evaluated by the lm function when data is supplied. In general, each R object being passed to Stan should be either a numeric vector (including the special case of a 'scalar') or a numeric array (matrix). The first exception is that an element can be a logical vector: TRUE's are converted to 1 and FALSE's to 0. An element can also be a data frame or a specially structured list (see details below), both of which will be converted into arrays in the preprocessing. Using a specially structured list is not encouraged though it might be convenient sometimes; and when in doubt, just use arrays.

This preprocessing for each element mainly includes the following:

  1. Change the data of type from double to integer if no accuracy is lost. The main reason is that by default, R uses double as data type such as in a <- 3. But Stan will not read data of type int from real and it reads data from int if the data type is declared as real.

  2. Check if there is NA in the data. Unlike BUGS, Stan does not allow missing data. Any NA values in supplied data will cause the function to stop and report an error.

  3. Check data types. Stan allows only numeric data, that is, doubles, integers, and arrays of these. Data of other types (for example, characters and factors) are not passed to Stan.

  4. Check whether there are objects in the data list with duplicated names. Duplicated names, if found, will cause the function to stop and report an error.

  5. Check whether the names of objects in the data list are legal Stan names. If illegal names are found, it will stop and report an error. See (Cmd)Stan's manual for the rules of variable names.

  6. When an element is of type data.frame, it will be converted to matrix by function data.matrix.

  7. When an element is of type list, it is supposed to make it easier to pass data for those declared in Stan code such as "vector[J] y1[I]" and "matrix[J,K] y2[I]". Using the latter as an example, we can use a list for y2 if the list has "I" elements, each of which is an array (matrix) of dimension "J*K". However, it is not possible to pass a list for data declared such as "vector[K] y3[I,J]"; the only way for it is to use an array with dimension "I*J*K". In addition, technically a data.frame in R is also a list, but it should not be used for the purpose here since a data.frame will be converted to a matrix as described above.

Stan treats a vector of length 1 in R as a scalar. So technically if, for example, "real y[1];" is defined in the data block, an array such as "y = array(1.0, dim = 1)" in R should be used. This is also the case for specifying initial values since the same underlying approach for reading data from R in Stan is used, in which vector of length 1 is treated as a scalar.

In general, the higher the optimization level is set, the faster the generated binary code for the model runs, which can be set in a Makevars file. However, the binary code generated for the model runs fast by using a higher optimization level at the cost of longer times to compile the C++ code.

References

The Stan Development Team Stan Modeling Language User's Guide and Reference Manual. http://mc-stan.org.

The Stan Development Team CmdStan Interface User's Guide. http://mc-stan.org.

See Also

stanc for translating model code in Stan modeling language to C++, sampling for sampling, and stanfit for the fitted results.

see extract and as.array.stanfit for extracting samples from stanfit objects.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
## Not run: 
#### example 1 
library(rstan)
scode <- "
parameters {
  real y[2]; 
} 
model {
  y[1] ~ normal(0, 1);
  y[2] ~ double_exponential(0, 2);
} 
"
fit1 <- stan(model_code = scode, iter = 10, verbose = FALSE) 
print(fit1)
fit2 <- stan(fit = fit1, iter = 10000, verbose = FALSE) 

## extract samples as a list of arrays
e2 <- extract(fit2, permuted = TRUE)

## using as.array on the stanfit object to get samples 
a2 <- as.array(fit2)

#### example 2
#### the result of this package is included in the package 

excode <- '
  transformed data {
    real y[20];
    y[1] = 0.5796;  y[2] = 0.2276;   y[3]  = -0.2959; 
    y[4] = -0.3742; y[5] = 0.3885;   y[6]  = -2.1585;
    y[7] = 0.7111;  y[8] = 1.4424;   y[9]  = 2.5430; 
    y[10] = 0.3746; y[11] = 0.4773;  y[12] = 0.1803; 
    y[13] = 0.5215; y[14] = -1.6044; y[15] = -0.6703; 
    y[16] = 0.9459; y[17] = -0.382;  y[18] = 0.7619;
    y[19] = 0.1006; y[20] = -1.7461;
  }
  parameters {
    real mu;
    real<lower=0, upper=10> sigma;
    vector[2] z[3];
    real<lower=0> alpha;
  } 
  model {
    y ~ normal(mu, sigma);
    for (i in 1:3) 
      z[i] ~ normal(0, 1);
    alpha ~ exponential(2);
  } 
'

exfit <- stan(model_code = excode, save_dso = FALSE, iter = 500)
print(exfit)
plot(exfit)

## End(Not run)
## Not run: 
## examples of specify argument `init` for function stan

## define a function to generate initial values that can
## be fed to function stan's argument `init`
# function form 1 without arguments 
initf1 <- function() {
  list(mu = 1, sigma = 4, z = array(rnorm(6), dim = c(3,2)), alpha = 1)
} 
# function form 2 with an argument named `chain_id`
initf2 <- function(chain_id = 1) {
  # cat("chain_id =", chain_id, "\n")
  list(mu = 1, sigma = 4, z = array(rnorm(6), dim = c(3,2)), alpha = chain_id)
} 

# generate a list of lists to specify initial values
n_chains <- 4
init_ll <- lapply(1:n_chains, function(id) initf2(chain_id = id))
 
exfit0 <- stan(model_code = excode, init = initf1) 
stan(fit = exfit0, init = initf2) 
stan(fit = exfit0, init = init_ll, chains = n_chains)

## End(Not run)

rstan documentation built on Jan. 20, 2018, 7:10 p.m.