# blrm: Bayesian Binary and Ordinal Logistic Regression In rmsb: Bayesian Regression Modeling Strategies

## Description

Uses `rstan` with pre-compiled Stan code to get posterior draws of parameters from a binary logistic or proportional odds semiparametric ordinal logistic model. The Stan code internally using the qr decompositon on the design matrix so that highly collinear columns of the matrix do not hinder the posterior sampling. The parameters are transformed back to the original scale before returning results to R. Design matrix columns are centered before running Stan, so Stan diagnostic output will have the intercept terms shifted but the results of `blrm()` for intercepts are for the original uncentered data. The only prior distributions for regression betas are normal with mean zero, and the vector of prior standard deviations is given in `priorsd`. These priors are for the qr-projected design matrix elements, except that the very last element is not changed. So if one has a single non-interactive linear or binary variable for which a skeptical prior is designed, put that variable last in the model.

## Usage

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33``` ```blrm( formula, ppo = NULL, cppo = NULL, keepsep = NULL, data = environment(formula), subset, na.action = na.delete, priorsd = rep(100, p), priorsdppo = rep(100, pppo), iprior = 0, conc = 1/(0.8 + 0.35 * max(k, 3)), ascale = 1, psigma = 1, rsdmean = if (psigma == 1) 0 else 1, rsdsd = 1, normcppo = TRUE, iter = 2000, chains = 4, refresh = 0, progress = if (refresh > 0) "stan-progress.txt" else "", x = TRUE, y = TRUE, loo = n <= 1000, ppairs = NULL, method = c("both", "sampling", "optimizing"), inito = if (length(ppo)) 0 else "random", inits = inito, standata = FALSE, file = NULL, debug = FALSE, ... ) ```

## Arguments

 `formula` a R formula object that can use `rms` package enhancements such as the restricted interaction operator `ppo` formula specifying the model predictors for which proportional odds is not assumed `cppo` a function that if present causes a constrained partial PO model to be fit. The function specifies the values in the Gamma vector in Peterson and Harrell (1990) equation (6). To make posterior sampling better behaved, the function should be scaled and centered. This is done by wrapping `cppo` in a function that scales the `cppo` result before return the vector value. See the `normcco` argument for how to prevent this. The default normalization is based on the mean and standard deviation of the function values over the distribution of observed Y. For getting predicted values and estimates post-`blrm()`, `cppo` must not reference any functions that are not available at such later times. `keepsep` a single character string containing a regular expression applied to design matrix column names, specifying which columns are not to be QR-orthonormalized, so that priors for those columns apply to the original parameters. This is useful for treatment and treatment interaction terms. For example `keepsep='treat'` will keep separate all design matrix columns containing `'treat'` in their names. Some characters such as the caret used in polynomial regression terms will need to be escaped by a double backslash. `data` a data frame; defaults to using objects from the calling environment `subset` a logical vector or integer subscript vector specifying which subset of data whould be used `na.action` default is `na.delete` to remove missings and report on them `priorsd` vector of prior standard deviations. If the vector is shorter than the number of model parameters, it will be repeated until the length equals the number of parametertimes. `priorsdppo` vector of prior standard deviations for non-proportional odds parameters. As with `priorsd` the last element is the only one for which the SD corresponds to the original data scale. `iprior` specifies whether to use a Dirichlet distribution for the cell probabilities, which induce a more complex prior distribution for the intercepts (`iprior=0`, the default), non-informative priors (`iprior=1`) directly on the intercept parameters, or to directly use a t-distribution with 3 d.f. and scale parameter `ascale` (`iprior=2`). `conc` the Dirichlet distribution concentration parameter for the prior distribution of cell probabilities at covariate means. The default is the reciprocal of 0.8 + 0.35 max(k, 3) where k is the number of Y categories. The default is chosen to make the posterior mean of the intercepts more closely match the MLE. For optimizing, the concentration parameter is always 1.0 to obtain results very close to the MLE for providing the posterior mode. `ascale` scale parameter for the t-distribution for priors for the intercepts if `iprior=2`, defaulting to 1.0 `psigma` defaults to 1 for a half-t distribution with 4 d.f., location parameter `rsdmean` and scale parameter `rsdsd`. Set `psigma=2` to use the exponential distribution. `rsdmean` the assumed mean of the prior distribution of the standard deviation of random effects. When `psigma=2` this is the mean of an exponential distribution and defaults to 1. When `psigma=1` this is the mean of the half-t distribution and defaults to zero. `rsdsd` applies only to `psigma=1` and is the scale parameter for the half t distribution for the SD of random effects, defaulting to 1. `normcppo` set to `FALSE` to leave the `cppo` function as-is without automatically centering and scaling the result `iter` number of posterior samples per chain for `rstan::sampling()` to run `chains` number of separate chains to run `refresh` see `rstan::sampling()`. The default is 0, indicating that no progress notes are output. If `refresh > 0` and `progress` is not `''`, progress output will be appended to file `progress`. The default file name is `'stan-progress.txt'`. `progress` see `refresh`. Defaults to `''` if `refresh = 0`. Note: If running interactively but not under RStudio, `rstan` will open a browser window for monitoring progress. `x` set to `FALSE` to not store the design matrix in the fit. `x=TRUE` is needed if running `blrmStats` for example. `y` set to `FALSE` to not store the response variable in the fit `loo` set to `FALSE` to not run `loo` and store its result as object `loo` in the returned object. `loo` defaults to `FALSE` if the sample size is greater than 1000, as `loo` requires the per-observation likelihood components, which creates a matrix N times the number of posterior draws. `ppairs` set to a file name to run `rstan` `pairs` and store the resulting png plot there. Set to `TRUE` instead to directly plot these diagnostics. The default is not to run `pairs`. `method` set to `'optimizing'` to run the Stan optimizer and not do posterior sampling, `'both'` (the default) to run both the optimizer and posterior sampling, or `'sampling'` to run only the posterior sampling and not compute posterior modes. Running `optimizing` is a way to obtain maximum likelihood estimates and allows one to quickly study the effect of changing the prior distributions. When `method='optimizing'` is used the result returned is not a standard `blrm()` object but is instead the parameter estimates, -2 log likelihood, and optionally the Hession matrix (if you specify `hessian=TRUE` in ...). When `method='both'` is used, `rstan::sampling()` and `rstan::optimizing()` are both run, and parameter estimates (posterior modes) from `optimizing` are stored in a matrix `param` in the fit object, which also contains the posterior means and medians, and other results from `optimizing` are stored in object `opt` in the `blrm()` fit object. When random effects are present, `method` is automatically set to `'sampling'` as maximum likelihood estimates without marginalizing over the random effects do not make sense. `inito` intial value for optimization. The default is the `rstan` default `'random'`. Frequently specifying `init=0` will benefit when the number of distinct Y categories grows or when using `ppo` hence 0 is the default for that. `inits` initial value for sampling, defaults to `inito` `standata` set to `TRUE` to return the Stan data list and not run the model `file` a file name for a `saveRDS`-created file containing or to contain the saved fit object. If `file` is specified and the file does not exist, it will be created right before the fit object is returned, less the large `rstan` object. If the file already exists, its stored `md5` hash string `datahash` fit object component is retrieved and compared to that of the current `rstan` inputs. If the data to be sent to `rstan`, the priors, and all sampling and optimization options and stan code are identical, the previously stored fit object is immediately returned and no new calculatons are done. `debug` set to `TRUE` to output timing and progress information to /tmp/debug.txt `...` passed to `rstan::optimizing()`. The `seed` parameter is a popular example.

## Details

The partial proportional odds model of Peterson and Harrell (1990) is implemented, and is invoked when the user specifies a second model formula as the `ppo` argument. This formula has no left-hand-side variable, and has right-side variables that are a subset of those in `formula` specifying for which predictors the proportional odds assumption is relaxed.

The Peterson and Harrell (1990) constrained partial proportional odds is also implemented, and is usually preferred to the above unconstrained PPO model as it adds a vector of coefficients instead of a matrix of coefficients. In the constrained PPO model the user provides a function `cppo` that computes a score for all observed values of the dependent variable. For example with a discrete ordinal outcome `cppo` may return a value of 1.0 for a specific value of Y and zero otherwise. That will result in a departure from the proportional odds assumption for just that one level of Y. The value returned by `cppo` at the lowest Y value is never used in any case.

`blrm()` also handles single-level hierarchical random effects models for the case when there are repeated measurements per subject which are reflected as random intercepts, and a different experimental model that allows for AR(1) serial correlation within subject. For both setups, a `cluster` term in the model signals the existence of subject-specific random effects.

See https://hbiostat.org/R/rms/blrm.html for multiple examples with results.

## Value

an `rms` fit object of class `blrm`, `rmsb`, `rms` that also contains `rstan` results under the name `rstan`. In the `rstan` results, which are also used to produce diagnostics, the intercepts are shifted because of the centering of columns of the design matrix done by `blrm()`. With `method='optimizing'` a class-less list is return with these elements: `coefficients` (MLEs), `beta` (non-intercept parameters on the QR decomposition scale), `deviance` (-2 log likelihood), `return_code` (see `rstan::optimizing()`), and, if you specified `hessian=TRUE` to `blrm()`, the Hessian matrix. To learn about the scaling of orthogonalized QR design matrix columns, look at the `xqrsd` object in the returned object. This is the vector of SDs for all the columns of the transformed matrix. Those kept out by the `keepsep` argument will have their original SDs.

## Author(s)

Frank Harrell and Ben Goodrich

`print.blrm()`, `blrmStats()`, `stanDx()`, `stanGet()`, `coef.rmsb()`, `vcov.rmsb()`, `print.rmsb()`, `coef.rmsb()`
 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22``` ```## Not run: getHdata(Titanic3) dd <- datadist(titanic3); options(datadist='dd') f <- blrm(survived ~ (rcs(age, 5) + sex + pclass)^2, data=titanic3) f # model summary using print.blrm coef(f) # compute posterior mean parameter values coef(f, 'median') # compute posterior median values stanDx(f) # print basic Stan diagnostics s <- stanGet(f) # extract rstan object from fit plot(s, pars=f\$betas) # Stan posteriors for beta parameters stanDxplot(s) # Stan diagnostic plots by chain blrmStats(f) # more details about predictive accuracy measures ggplot(Predict(...)) # standard rms output summary(f, ...) # invokes summary.rms contrast(f, ...) # contrast.rms computes HPD intervals plot(nomogram(f, ...)) # plot nomogram using posterior mean parameters # Fit a random effects model to handle multiple observations per # subject ID f <- blrm(outcome ~ rcs(age, 5) + sex + cluster(id), data=mydata) ## End(Not run) ```