R2BGLiMS: Call BGLiMS from R

Description Usage Arguments Value Author(s) See Also Examples

View source: R/R2BGLiMS.R

Description

Calls BGLiMS - a Java package for fitting GLMs under Bayesian model selection. NOTE: The predictors to explore with model selection are specified via the model.space.priors argument - see examples. By Default a common, and unknown, prior SD is used for all predictors under model selection, which in turn is assigned an Inverse-Gamma hyper-prior. Fixed normal priors may be user specified for all predictor (and confounder) coefficients via the beta.priors argument.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
R2BGLiMS(
  likelihood = NULL,
  data = NULL,
  outcome.var = NULL,
  times.var = NULL,
  confounders = NULL,
  model.selection = TRUE,
  model.space.priors = NULL,
  beta.priors = NULL,
  beta.prior.partitions = NULL,
  standardise.covariates = TRUE,
  empirical.intercept.prior.mean.and.initial.value = NULL,
  g.prior = TRUE,
  tau = NULL,
  xtx.ridge.term = 0,
  enumerate.up.to.dim = 0,
  X.ref = NULL,
  cor.ref = NULL,
  mafs.ref = NULL,
  ns.each.ethnicity = NULL,
  marginal.betas = NULL,
  n = NULL,
  n.iter = 1e+06,
  n.mil.iter = NULL,
  thinning.interval = NULL,
  seed = NULL,
  extra.arguments = NULL,
  initial.model = NULL,
  max.model.dim = -1,
  save.path = NULL,
  results.label = NULL,
  burnin.fraction = 0.5,
  trait.variance = NULL,
  logistic.likelihood.weights = NULL,
  mrloss.w = 0,
  mrloss.function = "variance",
  mrloss.marginal.by = NULL,
  mrloss.marginal.sy = NULL,
  mafs.if.independent = NULL,
  extra.java.arguments = NULL,
  debug = FALSE
)

Arguments

likelihood

Type of model to fit. Current options are "Logistic" (for binary data), "CLogLog" complementary log-log link for binary data, "Weibull" (for survival data), "Linear" (for linear regression), "LinearConj" (linear regression exploiting conjugate results), "JAM" (for conjugate linear regression using summary statistics, integrating out parameters) and "JAM_MCMC" (for linear regression using summary statistics, with full MCMC of all parameters).

data

Matrix or dataframe containing the data to analyse. Rows are indiviuals, and columns contain the variables and outcome. If modelling summary statistics specify X.ref, marginal.betas, and n instead (see below).

outcome.var

Name of outcome variable in data. For survival data see times.var below. If modelling summary statistics with JAM this can be left null but you must specify X.ref, marginal.beats and n instead (see below).

times.var

SURVIVAL DATA ONLY Name of column in data which contains the event times.

confounders

Optional vector of confounders to fix in the model at all times, i.e. exclude from model selection.

model.selection

Whether to use model selection (default is TRUE). NB: Even if set to FALSE, please provide a dummy model.space.priors argument (see below). This will not be used mathematically, but the package requires taking the list of variable names from the "Variables" element therein.

model.space.priors

Must be specified if model.selection is set to TRUE. Two options are available. 1) A fixed prior is placed on the proportion of causal covariates, and all models with the same number of covariates is equally likely. This is effectively a Poisson prior over the different possible model sizes. A list must be supplied for 'model.space.priors' with an element "Rate", specifying the prior proportion of causal covariates, and an element "Variables" containing the list of covariates included in the model search. 2) The prior proportion of causal covariates is treated as unknown and given a Beta(a, b) hyper-prior, in which case elements "a" and "b" must be included in the 'model.space.priors' list rather than "Rate". Higher values of "b" relative to "a" will encourage sparsity. NOTE: It is easy to specify different model space priors for different collections of covariates by providing a list of lists, each element of which is itself a model.space.prior list asm described above for a particular subset of the covariates.

beta.priors

This allows specifying fixed (potentially informative) priors for the covariate effect priors. A matrix must be passed, with named rows corresponding to parameters, and columns corresponding to the prior mean and variance in that order. When using this option priors must be specified for either just the confounders, which are otherwise given fixed N(0,1e6) priors, or for all covariates.

beta.prior.partitions

Covariate effects under variable selection are ascribed, by default, a common Normal prior, the standard deviation of which is treated as unknown, with a Unif(0.05,2) hyper-prior. This option can be used to partition the covariate effects into different prior groups, each with a seperate hierarchical normal prior. beta.prior.partitions must be a list with as many elements as desired covariate groups. The element for a particular group must in turn be a list containing the following named elements: "Variables" - a list of covariates in the prior group, and "UniformA" and "UniformB" the Uniform hyper parameters for the standard deviation of the normal prior across their effects.

standardise.covariates

Standardise covariates prior to RJMCMC such that they have a common (unit) standard deviation and mean zero. This is particularly recommended when covariates have substantially different variances, to improve the likelihood of exchangeable effect estimates, an asssumption that is required under the (default) common effect prior with unknown variance used in the RJMCMC. The standardisation is done invisibly to the user; parameter estimates are re-scaled back to unit increases on the original scale before producing posterior summaries. This is currently available for Logistic, Linear and Weibull regression. Default is TRUE. Note that when this option is used, covariate effects are re-scaled in the summary results table contained in the R2BGLiMS results object, such that they are still interpretable as effects corresponding to unit changes on the original covariate scale.

empirical.intercept.prior.mean.and.initial.value

Empirically set the intercept inital value and prior mean according to the mean outcome value, i.e. the expected intercept value when covariates are mean-centred. In some settings this can markedly improve mixing but EXPERIMENTATION is encouraged since in other settings it can do more harm than good. For Weibull regression this option leads to both the intercept AND the scale parameter having their prior means and initial values set according to a simple NULL Weibull model fit using the survreg function. In our experience this always improved mixing under Weibull regression, so is enabled by default, but can be over-ridden by setting this option to FALSE. For logistic regression, setting this option to TRUE leads to the intercept prior mean and initial value being set to the logit(case fraction), and for linear regression to the mean outcome. For linear and logistic regression the usefulness of this option is less clear, so it is off by default.

g.prior

Whether to use a g-prior for the beta's, i.e. a multivariate normal with correlation structure proportional to sigma^2*X'X^-1, which is thought to aid variable selection in the presence of strong correlation. By default this is enabled.

tau

Value to use for sparsity parameter tau (under the tau*sigma^2 parameterisation). When using the g-prior, a recommended default is max(n, P^2) where n is the number of individuals, and P is the number of predictors.

xtx.ridge.term

Value to add to the constant of the diagonal of X'X before JAM takes the Cholesky decomposition.

enumerate.up.to.dim

Whether to make posterior inference by exhaustively calculating the posterior support for every possible model up to this dimension. Leaving at 0 to disable and use RJMCMC instead. The current maximum allowed value is 5.

X.ref

Reference genotype matrix used by JAM to impute the SNP-SNP correlations. If multiple regions are to be analysed this should be a list containing reference genotype matrices for each region. Individual's genotype must be coded as a numeric risk allele count 0/1/2. Non-integer values reflecting imputation may be given. NB: The risk allele coding MUST correspond to that used in marginal.betas. These matrices must each be positive definite and the column names must correspond to the names of the marginal.betas vector.

cor.ref

Alternatively to a reference genotype matrix, a reference correlation matrix AND mafs may be supplied to JAM. NB: The risk allele coding MUST correspond to that used in marginal.betas. These matrices must each be positive definite and the column and row names must correspond to the names of the marginal.betas vector.

mafs.ref

Alternatively to a reference genotype matrix, a reference correlation matrix AND mafs may be supplied to JAM. NB: The risk allele coding MUST correspond to that used in marginal.betas. This must be a named vector with names correspond to the names of the marginal.betas vector.

ns.each.ethnicity

For mJAM: A vector of the sizes of each ethnicity dataset in which the summary statistics were calculated.

marginal.betas

Vector of (named) marginal effect estimates to re-analyse with JAM under multivariate models. For multi-ethnic "mJAM" please provide a list of vectors, each element of which is a vector of marginal effects for a specific ethnicity over the same variants.

n

The size of the dataset in which the summary statistics (marginal.betas) were calculated

n.iter

Number of iterations to run (default is 1e6)

n.mil.iter

Number of million iterations to run. Can optionally be used instead of n.iter for convenience, which it will overide if specified.

thinning.interval

Every nth iteration to store (i.e. for the Java algorithm to write to a file and then read into R). By default this is the number of iterations divided by 1e4 (so for 1 million iterations every 100th is stored.) Higher values (so smaller posterior sample) can lead to faster runtimes for large numbers of covariates.

seed

Which random number seed to use in the RJMCMC sampler. If none is provided, a random seed is picked between 1 and 2^16.

extra.arguments

A named list of additional arguments for which there are not currently dedicated options for. This can be used to modify various "under the hood" settings, including all prior hyper-parameters, MCMC mixing parameters such as the probabilities of add/delete/swap moves as well as the adaption settings. A list of all settings available for modification can be seen by typing "data(DefaultArguments)" and then "default.arguments", which lists their names and default values.

initial.model

An initial model for the covariates under selection can be specified as a vector of 0s and 1s. If left un-specified the null (empty) model is used.

max.model.dim

The maximum model dimension can be specified, therefore truncating the model space. We do not recommend using this option but it might sometimes be useful for robust-ness checks. When left un-specified there is no restriction on the model size.

save.path

By default R2BGLiMS writes the posterior samples to a temporary file that is deleted after they have been read in to R. By specifying a file path this can be kept, along with the temporary files used for the data and arguments, which can be useful sometimes for de-bugging.

results.label

When using the save.path option, this allows the user to specify a handle with which to name the files.

burnin.fraction

Initial fraction of the iterations to throw away, e.g. setting to 0 would mean no burn-in. The default of 0.5 corresponds to the first half of iterations being discarded.

logistic.likelihood.weights

An optional vector of likelihood weights for logistic regression. These weights multiply the log-likeihood contribution of each individual. The order should match the order of rows in the data matrix.

mrloss.w

The relative weight of the MR log loss function for pleiotropy vs the log likelihood. Default 0.

mrloss.function

Choice of pleiotropic loss function from "steve", "variance" (default variance)

mrloss.marginal.by

Marginal associations between SNPs and outcome for the MR loss function model.

mrloss.marginal.sy

Standard errors of marginal associations between SNPs and outcome for the MR loss function model (not required for mrloss.function "variance")

mafs.if.independent

If the SNPs are independent then a reference genotype matrix is not required. However, it is still necessary to provide SNP MAFs here as a named vector. Doing so will lead to X.ref being ignored and the SNPs to be modelled as if they are independent. Note that this option does not work with enumeration.

extra.java.arguments

A character string to be passed through to the java command line. E.g. to specify a different temporary directory by passing "-Djava.io.tmpdir=/Temp".

debug

Whether to output extra information (such as final adaption proposal SDs) which might help with debugging (default is FALSE).

Value

An R2BGLiMS_Results class object is returned. See the slot 'posterior.summary.table' for a posterior summary of all parameters. See slot 'mcmc.output' for a matrix containing the raw MCMC output from the saved posterior samples (0 indicates a covariate is excluded from the model in a particular sample. Some functions for summarising results are listed under "see also".

Author(s)

Paul Newcombe

See Also

Summary results are stored in the slot posterior.summary.table. See ManhattanPlot for a visual summary of covariate selection probabilities. For posterior model space summaries see TopModels. For convergence checks see TracePlots and AutocorrelationPlot.

Examples

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
library(R2BGLiMS)

# NB: See ?JAM for JAM examples.
#
# Examples below are:
# 1) Logistic regression
# 2) Weibull regression survival analysis
# 3) Case-cohort logistic regression (with hazard ratio estimation using Prenctice weighted Cox regression)
# 4) Linear regression
# 5) Linear regression using a conjugate posterior (with coefficients analytically intergreated out)
# 6) Fixed priors on effects, rather than the default common prior with unknown variance
# 7) Using two model space partitions

##########################################
# --- Example 1) Logistic regression --- #
##########################################

utils::data(biopsy, package = "MASS") # Example logistic dataset
covariate.names <- paste0("V",c(1:9))
# Recommend standardising predictors to justify default common hierarchical prior on effects
for (v in covariate.names) {biopsy[,v] <- scale(biopsy[,v])} 
biopsy$class <- as.integer(biopsy$class) - 1
biopsyResults <- R2BGLiMS( # Takes a few minutes to run
  likelihood="Logistic",
  data=biopsy,
  outcome.var="class",
  model.space.priors=list("a"=1, "b"=length(covariate.names), "Variables"=covariate.names), # Beta-binomial(1,P) model space prior
  extra.arguments=list("AlphaPriorMu"=log(mean(biopsy$class)/(1-mean(biopsy$class)))) # Recommend centering intercept prior on logit(event rate)
)
# Simple convergence diagnostic
plot(biopsyResults@mcmc.output[,"LogLikelihood"], type="l") # Looks ok
# Summary plot of selection probabilities
ManhattanPlot(biopsyResults) # Evidence of several independent effects
# Summary table
biopsyResults@posterior.summary.table
TopModels(biopsyResults)

###########################################################
# --- Example 2) Weibull regression survival analysis --- #
###########################################################

utils::data(VA, package = "MASS")
predictors <- c("treat","age","Karn","diag.time","prior")
for (v in predictors) {VA[,v] <- scale(as.numeric(VA[,v]))} # Normalise predictors
VA$stime <- VA$stime/max(VA$stime)# Recommend scaling survival times to between 0 and 1
va.results.weibull <- R2BGLiMS(
  likelihood="Weibull",
  data=VA,
  outcome.var="status",
  times.var="stime",
  model.space.priors=list(list("a"=1,"b"=length(predictors),"Variables"=predictors)) # Beta-binomial(1,P) model space prior
)
plot(va.results.weibull@mcmc.output[,"LogLikelihood"], type="l") # Looks ok
ManhattanPlot(va.results.weibull) # Clear signal at Kern
va.results.weibull@posterior.summary.table
TopModels(va.results.weibull)

######################################################
# --- Example 3) Case-cohort logistic regression --- #
######################################################

# --- Step 1: Logistic RJMCMC
data("CaseCohortExample") # Only V1,V2,V3,V4,V5 have true effects.
for (v in covariate.names) { data.cc[,v] <- data.cc[,v] - mean(data.cc[,v]) }
logistic.cc.res <- R2BGLiMS(
  likelihood="Logistic",
  data=data.cc,
  outcome.var="event",
  model.space.priors=list(list("a"=1,"b"=length(covariate.names),"Variables"=covariate.names)),
  n.mil=0.2,
  extra.arguments=list("AlphaPriorMu"=log(mean(data.cc$event)/(1-mean(data.cc$event))))
)
plot(logistic.cc.res@mcmc.output[,"LogLikelihood"], type="l") # Looks ok
ManhattanPlot(logistic.cc.res) # Identifies 4 out of 5 true effects
logistic.cc.res@posterior.summary.table

# --- Step 2: Effect estimation using the Prentice model
library(survival)
top.vars <- names(which(logistic.cc.res@posterior.summary.table[covariate.names,"BF"]>5))
prentice.model.formula <- as.formula(paste("Surv(times, event) ~ ",paste(top.vars, collapse="+")))
prentice.res <- cch(
  prentice.model.formula,
  data = data.cc,
  subcoh = ~subcohort,
  id=~ID,
  cohort.size=n.complete.cohort, 
  method="Prentice")
summary(prentice.res)

#######################################################
# --- Example 4) Linear regression with full MCMC --- #
#######################################################

data("LinearModelExample") # True effect at V1 only
lm.results <- R2BGLiMS(
  likelihood="Linear", 
  data=data.cts.outcome, 
  outcome.var="y", 
  confounders="confounder", # Example of a variable to include always
  model.space.priors = list("a"=1, "b"=length(covariate.names), "Variables"=covariate.names) # Beta-binomial(1,P) prior on model space
)
plot(lm.results@mcmc.output[,"LogLikelihood"], type="l") # Looks ok
ManhattanPlot(lm.results)
lm.results@posterior.summary.table

#############################################################################
# --- Example 5) Linear regression using a conjugate marginal posterior --- #
#############################################################################

# Coefficients are analytically integrated out
# Only models are sampled so mixing should be better
data("LinearModelExample") # True effect at V1 only
lm.conjugate.results <- R2BGLiMS(
  likelihood="LinearConj", 
  data=data.cts.outcome, 
  outcome.var="y",
  confounders="confounder", # Example of a variable to include always
  tau=max(nrow(data.cts.outcome),ncol(data.cts.outcome)^2), # Tau recommended to be maximum of P^2 and N
  model.space.priors = list("a"=1, "b"=length(covariate.names), "Variables"=covariate.names), # Beta-binomial(1,P)prior on model space
  g.prior=T)
# Summary plot of selection probabilities
ManhattanPlot(lm.conjugate.results) # More decisive than above
TopModels(lm.conjugate.results)
# Posterior sample of parameter in top model
posterior.sample <- NormInvGamPosteriorSample(
  data=data.cts.outcome, outcome.var ="y", confounders="confounder",
  model=c("V1"),tau=100)
# Median and credible interval
round(quantile(posterior.sample[,"V1"],c(0.5, 0.025, 0.975)),2) # Clearly excludes 0

##############################################
# --- Example 6) Fixed priors on effects --- #
##############################################

# By default, a common prior with unkown variance is assumed for all effects
# Might want to explicitly specify these priors instead the priors (e.g. if there is prior information)
data("LinearModelExample") # True effect at V1 only
lm.results.fixed.priors <- R2BGLiMS(
  likelihood="Linear", 
  data=data.cts.outcome, 
  outcome.var="y", 
  confounders="confounder", # Example of a variable to include always
  beta.priors = data.frame(
    cbind(
      rep(0,11), # Normal pior means
      rep(1,11)), # Normal prior SDs
    row.names=c("confounder",covariate.names)),
  model.space.priors = list("a"=1, "b"=length(covariate.names), "Variables"=covariate.names) # Beta-binomial(1,P) prior on model space
)
plot(lm.results.fixed.priors@mcmc.output[,"LogLikelihood"], type="l") # Looks ok
ManhattanPlot(lm.results.fixed.priors)
lm.results.fixed.priors@posterior.summary.table

#################################################
# --- Example 7) Two model space partitions --- #
#################################################

# Can provide different levels of prior sparsity to different groups of covariates
# Demonstrated below with two groups
data("LinearModelExample") # True effect at V1 only
lm.results.two.model.space.partitions <- R2BGLiMS(
  likelihood="Linear", 
  data=data.cts.outcome, 
  outcome.var="y", 
  confounders="confounder", # Example of a variable to include always
  model.space.priors = list(
    list("a"=1, "b"=1, "Variables"=covariate.names[1:5]), # Generous 50/50 prior proportion
    list("a"=1, "b"=1000, "Variables"=covariate.names[6:10]) # Very sparse!
    )
)
plot(lm.results.two.model.space.partitions@mcmc.output[,"LogLikelihood"], type="l") # Looks ok
ManhattanPlot(lm.results.two.model.space.partitions)
lm.results.two.model.space.partitions@posterior.summary.table

pjnewcombe/R2BGLiMS documentation built on Feb. 10, 2020, 8:52 p.m.