Description Usage Arguments Details Value References Examples

This function reads in tabulated data sets and produces summary statistics needed for Bayesian linear regression models for use in the function `bayes.regress()`

(in this package). Big data sets that are too large to fit into R memory are handled using functions from package `ff`

. The function takes as input data files with the predictor variables *X* and response values *Y*, and returns the summary statistics *X'X*, *X'Y* and *Y'Y* that are used as an input to the function `bayes.regress()`

(in this package) for Bayesian linear regression models. The function supports reading data sets that are split across multiple files.

1 2 3 4 5 6 7 8 9 10 | ```
read.regress.data.ff(filename=NULL,predictor.cols=NA,response.col=NA,update.summaries=NULL
, fileEncoding = "", nrows = -1, first.rows = 1e5, next.rows = 1e5
, levels = NULL, appendLevels = TRUE,FUN = "read.table",transFUN = NULL
, asffdf_args = list(), BATCHBYTES = getOption("ffbatchbytes")
, VERBOSE = FALSE, header = FALSE, sep = ",", quote = "\"'", dec = "."
, numerals = c("allow.loss", "warn.loss", "no.loss")
, na.strings = "NA", colClasses = "numeric", skip = 0
, check.names = TRUE, fill = TRUE, strip.white = FALSE
, blank.lines.skip = TRUE, comment.char = "#", allowEscapes = FALSE
, flush = FALSE, skipNul = FALSE)
``` |

`filename` |
the name of a file or a list of file names, from which the data will be read. By default it is assumed that the file(s) contain data in comma separated format; this can be changed using the |

`predictor.cols` |
a vector of integers that specifies the columns to treat as predictor variables, to create the design matrix |

`response.col` |
an integer that specifies the column that contains the response variable values, to create the response vector |

`update.summaries` |
The name of the R object containing previously-calculated summary statistics (if applicable), to be updated with new data. This must be a list similar in structure to the returned value, containing entries |

The remaining arguments are passed directly to the function `read.table.ffdf()`

. Below is a short description of the arguments we recommend to set manually in accordance with memory limitations and data structure:

`first.rows` |
the number of rows to read in the first chunk of data. Default = 100,000. |

`next.rows` |
the number of rows to read in the remaining chunks of data. Default = 100,000. |

`sep` |
the character that separates the columns of data. Default = ",". |

For the arguments below the default settings should perform well. However, in some situations adjusting these arguments may improve memory use and running time.

`fileEncoding` |
a string that describes the file's character encoding |

`nrows` |
an integer specifying how many rows should be read from the file |

`levels` |
an optional list of items with |

`appendLevels` |
a logical vector of permission to expand |

`FUN` |
specifies which standard R function is used to read the data. |

`transFUN` |
an optional filtering function to be applied to each chunk of data. See |

`asffdf_args` |
an optional list of parameters to be passed to |

`BATCHBYTES` |
an integer limiting the size of the |

`VERBOSE` |
See |

`header` |
a logical value indicating if the first row is the header row |

`quote` |
a character string specifying which character will be treated as quoting characters |

`dec` |
a character used for decimal dot |

`numerals` |
see |

`na.strings` |
strings treated as |

`colClasses` |
a vector that describes the data types in each column. Numeric by default. |

`skip` |
how many first lines in the file should be skipped |

`check.names` |
see |

`fill` |
logical value that turns on automatic padding of the rows in case they have different lengths |

`strip.white` |
affects the processing of the columns with declared |

`blank.lines.skip` |
logical value, indicating whether empty lines should be ignored |

`comment.char` |
character specifying the comment marker |

`allowEscapes` |
logical. If |

`flush` |
see |

`skipNul` |
logical: should nuls be skipped? |

The function reads in data and computes summary statistics to be used in Bayesian linear regression by the function `bayes.regress()`

(in this package). The function assumes the linear regression model will have a non-zero y-intercept; this option can be changed in the `bayes.regress()`

function (see `bayes.regress()`

help for details).

The returned value for the `read.regress.data.ff()`

function is a list containing the summary statistics named `xtx`

(for *X'X*), `xty`

(for *X'Y*), `yty`

(for *Y'Y*) and the total number of data values `numsamp.data`

. The summary statistic `xtx`

contains a square matrix obtained by computing a dot product of the predictor variables data *X* with itself; a leading column of 1's is added to *X* for the y-intercept term. `xty`

contains the vector obtained by computing the dot product of the transposed predictor variables data *X* with response variable data *Y*; a leading column of 1's is added to *X* for the y-intercept term. `yty`

contains the dot product of the response variable data *Y* with itself. `numsamp.data`

is the number of data values read from the data file(s); this number may be smaller than the number of rows in the data file, since some of the rows with missing data may be skipped according to specified function arguments. The summary statistics *X'X*, *X'Y* and *Y'Y* are summed over data chunks by the following, for *m = 1,...,M* chunks:

*X'X = sum_{m=1}^{M} (X'_{m})( X_{m})*

*X'Y = sum_{m=1}^{M} (X'_{m})( Y_{m})*

*Y'Y = sum_{m=1}^{M} (Y'_{m})( Y_{m})*

The returned values are used as input to the function `bayes.regress()`

(in this package). Note that the matrix `X`

is given a leading column of 1's by default, for the y-intercept term of the Bayesian linear regression model. This can be removed by specifying a model with zero intercept in the function `bayes.regress()`

(see `bayes.regress()`

help for details).

Carlin, B.P. and Louis, T.A. (2009) *Bayesian Methods for Data Analysis, 3rd ed.*, Boca Raton, FL: Chapman and Hall/CRC Press.

Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A. and Rubin, D.B. (2013) *Bayesian Data Analysis, 3rd ed.*, Boca Raton, FL: Chapman and Hall/CRC Press.

Adler, D., Glaser, C., Nenadic, O., Oehlschlagel, J. and Zucchini, W. (2013) ff: memory-efficient storage of large data on disk and fast access functions. R package: http://CRAN.R-project.org/package=ff.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 | ```
# The package includes several example data files, illustrated here.
###########
# Example 1
###########
# The following command finds the location of the data file
# that includes 4 predictor variables and 20,000 simulated data values.
filename <- system.file('data/regressiondata.nz.all.csv.gz', package='BayesSummaryStatLM')
# The file is formatted so that the simulated response variable is in the
# first column, and columns 2 to 5 contain simulated predictor variables.
# The simulated coefficients are: beta <- c(0.76, -0.92, 0.64, 0.57, -1.65),
# where the first value is the y-intercept term in the Bayesian linear
# regression model. The sigma-squared term, i.e. the variance of the normally
# distributed error terms, is simulated as: sigmasq <- 0.25
## Next, read the data and compute the summary statistics using the
# "read.regress.data.ff()" function. By default, the first column is assumed
# to be the response variable, and the remaining columns are assumed to contain
# predictor variable values. The function will check if the file exists and
# can be read.
data.values <- read.regress.data.ff(filename)
data.values
###########
# Example 2
###########
## Several files can be given in a list to be read sequentially, as follows.
filenames <- list(
system.file('data/regressiondata.nz.pt1.csv.gz', package='BayesSummaryStatLM'),
system.file('data/regressiondata.nz.pt2.csv.gz', package='BayesSummaryStatLM')
)
data.values <- read.regress.data.ff(filenames)
data.values
# The above results can be compared to the "data.values" obtained previously. They
# are the same, since the current files are just copies of the same data split
# between two files.
###########
# Example 3
###########
## The two files can be read progressively through time, and the summary statistics
# are then updated with data in each file, as follows.
filenames <- list(
system.file('data/regressiondata.nz.pt1.csv.gz', package='BayesSummaryStatLM'),
system.file('data/regressiondata.nz.pt2.csv.gz', package='BayesSummaryStatLM')
)
data.values <- read.regress.data.ff(filenames[[1]])
data.values
data.values2 <- read.regress.data.ff(filenames[[2]], update.summaries = data.values)
data.values2
###########
# Example 4
###########
## If not all columns are to be used in regression analysis, one can specify
# which columns to use in the "predictor.cols" and "response.col" options;
# the order of "predictor.cols" can also be changed. The following command
# reads in predictors from a subset of 3 columns, and changes their order.
filename <- system.file('data/regressiondata.nz.all.csv.gz', package='BayesSummaryStatLM')
data.values <- read.regress.data.ff(filename, predictor.cols=c(4,2,3), response.col=5)
data.values
###########
# Example 5
###########
## If the R session must be terminated, the summary statistics can be saved and then
# loaded using standard methods in R, as follows:
filenames <- list(
system.file('data/regressiondata.nz.pt1.csv.gz', package='BayesSummaryStatLM'),
system.file('data/regressiondata.nz.pt2.csv.gz', package='BayesSummaryStatLM')
)
data.values <- read.regress.data.ff(filenames[[1]])
tmpfname <- tempfile()
save(data.values, file = tmpfname)
rm(data.values)
# Now the R session can be terminated. Note that the filename "tmpfname"
# must be recorded so that it can be used for updating in a later R session.
# Upon starting a new R session, the state of the previously-calculated
# summary statistics in the file named "tmpfname" can be restored and
# then updated, as follows:
load(tmpfname)
unlink(tmpfname)
# If a new portion of a data set arrives, the summary statistics are updated
# as follows:
data.values2 <- read.regress.data.ff(filenames[[2]], update.summaries = data.values)
data.values2
``` |

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.