knitr::opts_chunk$set(echo = TRUE,comment = "#",collapse = TRUE, fig.align = "center",tidy = FALSE)
This is a brief introduction to the flashr
package. The examples
here will suffice for many users. For more advanced usage, see
the other vignettes.
The flashr
package fits the model $$Y = LF' + E$$ where $Y$ is an $n
\times p$ matrix, $L$ is an $n \times k$ matrix, and $F$ is a $p
\times k$ matrix, and $E$ is an $n \times p$ matrix of residuals which
(by default) are assumed to have column-specific variances:
$$E_{ij} \sim N(0,\sigma_{j}^2).$$
The columns of $L$ are called the "loadings" and columns of $F$ are called the "factors". Unlike some other methods (e.g. singular value decomoposition) these columns are not required to be orthogonal.
The key idea behind flashr
is to use Empirical Bayes methods to
estimate a "prior" distribution for each factor and each loading. This
can induce sparsity on either/or factors and loadings if the data
support this.
The scaling of the factors and loadings in the above model is arbitrary. Often it is convenient to scale the loadings and columns to have unit norm. In this case we can write
$$Y = LDF' + E,$$
where $D$ is a diagonal matrix of "weights" that capture the strength
of each factor/loading. The flash function flash_get_ldf
returns the
values of $L, D$ and $F$ in this model (with $D$ returned as a vector
of the diagonal values).
First we will simulate some data for illustration (in this case the loadings and factors are not at all sparse, so the benefits of flash over other methods might be limited.)
library(flashr) set.seed(1) n = 100 p = 500 k = 7 LL = matrix(rnorm(n*k), nrow=n) FF = matrix(rnorm(p*k), nrow=p) Y = LL %*% t(FF) + rnorm(n*p)
When running flash
, usually you will run one command to fit the
model, and a second to extract results you want from the fit. The
flash
function is the one most users will want to start with to fit
the model. The values of $L,D$ and $F$ are then stored in the ldf
field of the flash object.
By default flash
runs the greedy algorithm from Wang and Stephens.
This is usually the quickest approach. It simply adds factors one at a
time, until adding a new factor doesn't improve the fit:
f = flash(Y, verbose=FALSE) ldf = f$ldf ldf$d
Note that flash correctly stopped at 7 factors for these data. (Although it tries to fit an 8th factor, from the results we can see that this does not appear: this is because it turned out to be 0.)
Sometimes the backfitting algorithm can improve the fit (but it takes
longer). To do backfitting on a previous flash fit you can provide
that fit as the f_init
parameter, and set backfit=TRUE
. Here we
use the fit obtained from the greedy function as our initialization.
f.b = flash(Y, f_init = f, backfit=TRUE, greedy=FALSE, verbose=FALSE) f.b$ldf$d
If you know you are going to want to do backfitting anyway, you can do greedy+backfitting in one command without saving the intermediate result:
f.gb = flash(Y, backfit=TRUE, greedy=TRUE, verbose=FALSE) f.gb$ldf$d
Suppose your matrix has some missing entries and you want to fill them
in. You can do this by first running flash on your matrix and then
using flash_fill
. (This function fills in the missing entries using
the relevant elements of the estimated LDF' matrix).
To illustrate, we add a little missing data to $Y$ here before running.
Ymiss=Y Ymiss[1:20,1] = NA f = flash(Ymiss, verbose=FALSE) Ynew = flash_fill(Ymiss, f) plot(Y[1:20,1],Ynew[1:20,1])
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.