Description Usage Arguments Details Value References Examples
Given matrices ymat
, xmat
, and zmat
,
pulverize
evaluates every linear regression
y = b0 + b1 x + b2 z + b3 xz
where y, x, and z are columns of ymat
,
xmat
, and zmat
, respectively, and returns the p-value
for the null hypothesis b3=0. For example, if
ymat
, xmat
, and zmat
have 200, 300, and 400
columns, then pulverize
evaluates 24 million models
(200 * 300 * 400) and returns the
p-value for the interaction term for each model.
1 2 3 |
ymat, xmat, zmat |
Matrices of type "double", number of rows for all matrices are the number of observations and must have the same size |
output_file |
Output file if |
colnames |
Column names for result table (default: "y", "x", and "z") |
pvalue_threshold |
Report only p-values below threshold
(default: |
cores |
Number of cores to use for parallelization (default: 1) |
overwrite |
If |
suppress_return |
If |
.fp |
File pointer, only for internal use. Necessary for the function
|
For reasons of computational efficiency, pulverize
returns
only p-values and only for the interaction term xz.
Fast run time is achieved by using the correlation coefficient between
the outcome variable y and the interaction term xz to test
the null-hypothesis, which avoids the costly computation of inversions.
Additional employed time-saving operations are a rearrangement of the order when iterating through
the different matrices, and implementing the core algorithm in the fast
programming language C++.
Once interesting models are identified based on the resulting p-values, the number of
models will be greatly reduced. At this point additional model
characteristics, e.g. effect estimates and standard errors, can be
obtained via traditional methods such as R's
lm
function.
Matrices ymat
, xmat
, and zmat
must
have column names. Missing values are imputed using their
column means. If pvalue_threshold
is supplied, only
p-values (p < pvalue_threshold
) strictly below the
threshold are included in the returned data frame and saved
in output_file
.
In cases where the resulting data frame would be too large to fit
in memory, it is possible to write the results to
output_file
without returning a data frame by setting
suppress_return
to TRUE
.
The column names of the results table are by default set to "y", "x", and
"z", but can be changed using the colnames
argument.
An error will be signaled if output_file
already exists.
Setting overwrite
to TRUE
will silently overwrite the
file.
The computation can be parallelized by specifying a number of
cores
greater than 1. By default only a single CPU is
used. Note that parallelization is only supported in
environments with C/C++ compilers that support OpenMP.
If suppress_return
is FALSE
, a data frame with
columns "y", "x", "z", and "pvalue" containing the p-values for
the interaction term xz of the above linear model is returned.
Otherwise NULL
is returned.
Shabalin, Andrey A (2012) Bioinformatics: Matrix eQTL: ultra fast eQTL analysis via large matrix operations, Oxford Univ Press, *28*, 1353-1358
1 2 3 4 5 6 7 8 | nobs <- 100
y <- matrix(rnorm(nobs * 2), ncol = 2, dimnames = list(paste0("row", 1:nobs),
paste0("column", 1:2)))
x <- matrix(rnorm(nobs * 3), ncol = 3, dimnames = list(paste0("row", 1:nobs),
paste0("column", 1:3)))
z <- matrix(rnorm(nobs * 4), ncol = 4, dimnames = list(paste0("row", 1:nobs),
paste0("column", 1:4)))
pulverize(y, x, z)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.