mrbayes: Bayesian linear regression using Merge and Reduce

Description Usage Arguments Value Details References Examples

Description

mrbayes is used to conduct Bayesian linear regression on very large data sets using Merge and Reduce as described in Geppert et al. (2020). Package rstan needs to be installed. When calling the function this is checked using requireNamespace as suggested by Hadley Wickham in "R packages" (section Dependencies, http://r-pkgs.had.co.nz/description.html, accessed 2020-07-31).

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
mrbayes(
  y,
  intercept = TRUE,
  fileMr = NULL,
  dataMr = NULL,
  obsPerBlock,
  dataStan = NULL,
  sep = "auto",
  dec = ".",
  header = TRUE,
  naStrings = "NA",
  colNames = NULL,
  naAction = na.fail,
  ...
)

Arguments

y

(character)
Column name of the dependent variable.

intercept

(logical)
Argument specifying whether the model should have an intercept term or not. Defaults to TRUE.

fileMr

(character)
The name of a file, including the filepath, to be read in blockwise. Either fileMr or dataMr needs to be specified. When using this argument, the arguments sep, dec, header, naStrings, colNames (as in fread) are of relevance. Further options from fread are currently not supported. Also note that defaults might differ. In case the data to be read in has row names, note that these will be read in as regular column. This may need special treatment.

dataMr

(data.frame)
The data to be used for the regression analysis. Either fileMr or dataMr needs to be specified. Note that the arguments sep, dec, header, naStrings, and colNames are ignored when dataMr is specified.

obsPerBlock

(numeric)
Value specifying the number of observations in each block. This number has to be larger than the number of regression coefficients. Moreover, the recommended ratio of observations per regression coefficient is larger than 25 (Geppert et al., 2020). Note that the last block may contain less observations than specified depending on the sample size. If the number of observations in this last block is too small it is not included in the model and a warning is issued.

dataStan

(list)
Optional argument. This argument is equivalent to the argument data in stan. If not specified the default dataStan, which makes use of all predictors is used. See section Details for the default dataStan and further notes on the syntax to be used when specifiying this argument.

sep

See documentation of fread. Default is "auto". Ignored when dataMr is specified.

dec

See documentation of fread. Default is ".". Ignored when dataMr is specified.

header

(logical)
See documentation of fread. Defaults to TRUE. Ignored when dataMr is specified. If header is set to FALSE and no colNames are given, then column names default to "V" followed by the column number.

naStrings

(character)
Optional argument. See argument na.strings of fread. Default is "NA". Ignored when dataMr is specified and optional when fileMr is used.

colNames

(character vector)
Same as argument col.names of fread. Ignored when dataMr is specified and optional when fileMr is used.

naAction

(function)
Action to be taken when missing values are present in the data. Currently only na.fail is supported.

...

Further optional arguments to be passed on to stan, especially pars and arguments that control the behaviour of the sampling in rstan such as chains, iter, warmup, and thin. Please refer to rstan.

Value

Returns an object of class "mrbayes" which is a list containing the following components:

level

Number of level of the final model in Merge and Reduce. This is equal to log2(ceiling(numberObs/obsPerBlock))+1 and corresponds to the number of buckets in Figure 1 of Geppert et al. (2020).

numberObs

The total number of observations.

summaryStats

Summary statistics including the mean, median, quartiles, 2.5% and 97.5% quantiles of the posterior distributions for each regression coefficient and the error term's standard deviation sigma.

diagnostics

Effective sample size (n_eff) and potential scale reduction factor on split chains (Rhat) calculated from the output of summary,stanfit-method. Note that, using Merge and Reduce, for each regression coefficient only one value is reported: For n_eff the minimum observed value on level 1 is reported and for Rhat the maximum observed value on level 1 is reported.

modelCode

The model. Syntax as in argument model_code of stan.

dataHead

First six rows of the data in the first block. This serves as a sanity check, especially when using the argument fileMr.

Details

Code of default dataStan makes use of all predictors:
dataStan = list(n = nrow(currentBlock),
d = (ncol(currentBlock) - 1),
X = currentBlock[, -colNumY],
y = currentBlock[, colNumY])
where currentBlock is the current block of data to be evaluated, n the number of observations, d the number of variables (without intercept), X contains the predictors, and y the dependent variable. colNumY is the column number of the dependent variable that the function finds internally.

When specifying the argument dataStan, note two things:
1. Please use the syntax of the default dataStan, i.e. the object containing the data of the block to be evaluated is called currentBlock, the number of observations must be set to n = nrow(currentBlock), d needs to be set to the number of variables without intercept, the dependent variable must be named y, and the independent variables must be named X.
2. The expressions within the list must be unevaluated: Therefore, use the function quote.

References

Geppert, L.N., Ickstadt, K., Munteanu, A., & Sohler, C. (2020). Streaming statistical models via Merge & Reduce. International Journal of Data Science and Analytics, 1-17,
doi: https://doi.org/10.1007/s41060-020-00226-0

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# Package rstan needs to be installed for running this example.
 
if (requireNamespace("rstan", quietly = TRUE)) {
  n = 2000
  p = 4
  set.seed(34)
  x1 = rnorm(n, 10, 2)
  x2 = rnorm(n, 5, 3)
  x3 = rnorm(n, -2, 1)
  x4 = rnorm(n, 0, 5)
  y = 2.4 - 0.6 * x1 + 5.5 * x2 - 7.2 * x3 + 5.7 * x4 + rnorm(n)
  data = data.frame(x1, x2, x3, x4, y)

  normalmodell = '
  data {
  int<lower=0> n;
  int<lower=0> d;
  matrix[n,d] X; // predictor matrix
  vector[n] y; // outcome vector
  }
  parameters {
  real alpha; // intercept
  vector[d] beta; // coefficients for predictors
  real<lower=0> sigma; // error scale
  }
  model {
  y ~ normal(alpha + X * beta, sigma); // likelihood
  }
  ' 

  datas = list(n = nrow(data), d = ncol(data)-1,
               y = data[, dim(data)[2]], X = data[, 1:(dim(data)[2]-1)])
  fit0 = rstan::stan(model_code = normalmodell, data = datas, chains = 4, iter = 1000)
  fit1 = mrbayes(dataMr = data, obsPerBlock = 500, y = 'y')
}

mrregression documentation built on Jan. 13, 2021, 6:06 a.m.