# sgmcmc: Getting Started In sgmcmc: Stochastic Gradient Markov Chain Monte Carlo

The goal of `sgmcmc` is to make it as easy as possible for users to run stochastic gradient MCMC (SGMCMC) algorithms. SGMCMC are algorithms which enable MCMC to scale more easily to large datasets, as traditional MCMC can run very slowly as dataset sizes grow.

`sgmcmc` implements a lot of the popular stochastic gradient MCMC methods including SGLD, SGHMC and SGNHT. The package uses automatic differentiation, so all the differentiation needed for the methods is calculated automatically. Control variate methods can be used in order to improve the efficiency of the methods as proposed in the recent publication. This package is designed to be user friendly. In order to execute these algorithms, users only need to specify the data; the log-likelihood function and log-prior density; the parameter starting values; and a few tuning parameters.

To enable as much flexibility as possible, the data and parameter starting points fed to the functions in `sgmcmc` are specified as lists. This allows users to specify multiple parameters and datasets. It also allows the user to easily reference these quantities in the log-likelihood function and log-prior density, and to set different stepsizes for different paramters (essential when parameters are on different scales).

### Specifying the Data

As we mentioned earlier, the datasets you wish to use are specified as a list. Suppose we have datasets we have already obtained or created `X` and `Y`, we would specify the whole dataset for our session as

```dataset = list("X" = X, "Y" = Y)
```

You can specify as many datasets as you like, the most important thing is that your naming is consistent, so you use the same names in your log-likelihood and log-prior functions, and in your list of stepsizes.

The functions assume that each observation is located on the first axis of the object. Suppose `Y` is a 2d matrix, then the observation \$Y_i\$ should be located at `dataset\$Y[i,]`. Similarly if `X` is a 3d array, observation \$X_i\$ should be located at `dataset\$X[i,,]`.

### Specifying the Parameters

Again parameters are specified as a list, the names are what we will refer to them as in the `logLik` and `logPrior` functions introduced below, the values are the desired starting points. Suppose my model depends on two parameter vectors `theta1` and `theta2`, and we want to start both from 0. If we assume both are length 3, this could be specified like this

```params = list("theta1" = rep(0, 3), "theta2" = rep(0, 3))
```

### Specifying the Log Likelihood and Log Prior

The log-likelihood function is specified as a function of the `dataset` and `params`, which have the same names as the lists you just specified. The only difference is that the objects inside the lists will have automatically been converted to TensorFlow objects for you. The `dataset` list will contain TensorFlow placeholders. The `params` list will contain TensorFlow variables. The `logLik` function should be a function that takes these lists as input and returns what the log-likelihood value should be given the current parameter and data values. It should do this using TensorFlow operations. More details about how TensorFlow works in `R` can be found here.

Specifying the `logLik` and `logPrior` functions regularly requires specifying specific distributions. `TensorFlow` already has a number of distributions implemented in the `TensorFlow Probability` package. All of the distributions implemented in TensorFlow Probability are located in `tf\$distributions`, a list is given on the TensorFlow Probability website. More complex distributions can be specified by coding up the `logLik` and `logPrior` functions by hand, examples of this, as well as using various distribution functions, are given in the other tutorials.

The `logPrior` function is specified in exactly the same way except that the function should only take `params` as input, as a prior should be independent of the dataset. You do not have to specify the prior at all, and this leads to the algorithm using a uniform, uninformative prior.

Suppose we want to simulate from the mean of a multivariate Normal density with each component of the mean having a Student-T prior, we would specify this as follows

```library(MASS)
# Simulate and declare dataset
dataset = list("X" = mvrnorm(10^4, c(0, 0), diag(2)))
# Simulate random starting point
params = list("theta" = rnorm(2))

# Declare log likelihood
logLik = function(params, dataset) {
# Declare distribution, assuming Sigma known and constant
SigmaDiag = c( 1, 1 )
# Return sum of log pdf
return(tf\$reduce_sum(distn\$log_prob(dataset\$X)))
}

# Declare log prior
logPrior = function(params) {
# Declare prior distribution
distn = tf\$distributions\$StudentT(3, 0, 1)
# Apply log prior componentwise and return sum
return(tf\$reduce_sum(distn\$log_prob(params\$theta)))
}
```

Most of the time just specifying the constants in these functions, such as `SigmaDiag` as `R` objects will be fine. But occassionally there are issues when these constants get automatically converted to `tf\$float64` objects by `TensorFlow` rather than `tf\$float32`. If you run into errors involving `tf\$float64` then force the constants to be input as `tf\$float32` by using `SigmaDiag = tf\$constant( c( 1, 1 ), dtype = tf\$float32 )`.

### Specifying the Tuning Parameters

The only other input that needs to be specified to set any of the standard stochastic gradient MCMC methods running is the `stepsize`. This is normally defined as a list with an entry for each parameter name as follows

```stepsize = list( "theta = 1e-5" )
```

A short hand is just to set `stepsize = 1e-5` which just sets the stepsize of each parameter to be `1e-5`. This shorthand can be used for any of the tuning constants.

So to get `sgld` working for the multivariate Normal log-likelihood function we specified, and an uninformative uniform prior, we can simply run

```sgld( logLik, dataset, params, stepsize )
```

similarly for `sghmc` and `sgnht`.

To use the Student-t prior we specified, and set a minibatch size of `500`, we'd use

```sgld( logLik, dataset, params, stepsize, logPrior = logPrior, minibatchSize = 500 )
```

For more details of the other optional arguments see the main documentation in the reference.

All of the control variate methods `sgldcv`, `sghmccv` and `sgnhtcv` require an extra input `optStepsize`, which is the stepsize tuning constant for the initial optimization to find the MAP estimates. This argument is simply a small numeric value, such as `0.1`, and may require some tuning, which we talk about below. To run `sgldcv` on the multivariate Normal model, with specified prior we'd use

```optStepsize = 0.1
sgldcv( logLik, dataset, params, stepsize, optStepsize, logPrior = logPrior, minibatchSize = 500 )
```

similar for `sghmccv` and `sgnhtcv`.

Most of the time, parameters need tuning, we suggest doing this using cross validation. You can roughly check the algorithm is converging by inspection by checking that the `Log-posterior estimate` output by the algorithm settles down eventually (it should decrease at first unless the chain converges very quickly).

### Next Steps

We suggest for more details you read the worked examples in the Articles section, these cover a variety of models (which will be expanded as the package matures): - Multivariate Gaussian - Gaussian Mixture - Logistic Regression

The SGMCMC algorithms can also be run step by step, which allows custom storage of parameters using test functions, or sequential estimates. Useful if your chain is too large to fit into memory! This requires a better knowledge of TensorFlow. An example of this is given in the neural network vignette.

Full details of the API can be found here.

## Try the sgmcmc package in your browser

Any scripts or data that you put into this service are public.

sgmcmc documentation built on Oct. 30, 2019, 11:39 a.m.