knitr::opts_chunk$set(echo = TRUE,comment = "#",collapse = TRUE, fig.align = "center",tidy = FALSE)
This vignette is designed to describe some more advanced use of flash. Most users should start with the flash_intro.html vignette.
The interface described may change as we develop flashr
.
We start by simulating some data to provide an example:
library(flashr) set.seed(1) n = 100 p = 500 k = 7 LL = matrix(rnorm(n*k),nrow=n) FF = matrix(rnorm(p*k),nrow=p) Y = LL %*% t(FF) + rnorm(n*p)
For anything beyond the simplest usage, you should start your analysis
by setting up the data in what we will call a "flash data object".
This is essentially simply an n by p matrix, but with meta-data to
allow flash
functions to deal properly with missing values
etc. (This framework also makes it easier for future extensions to
pass more information in as data; e.g. maybe variances also associated
with the matrix).
data = flash_set_data(Y)
Second, we need to initialize a factor model - essentially a set of
factors and loadings. We call this a "flash fit object", although at
this point it is not actually fit to the data (this is the next step!)
There are several ways to do this, using functions of the form
flash_add_xxx
. They are called this because they can add factors to
an existing flash fit object---by providing a value for the argument
f_init
---so you can build up a flash fit object bit-by-bit. Here
we have no existing flash fit object, so we do not specify f_init
,
and in this case these functions create a new fit object.
For illustration we show flash_add_factors_from_data
. This
essentially runs softImpute
(soft-thresholded singular value
decomposition) on the data to obtain an initial set of factors and
loadings. We have to choose a number of factors to initialize
with. (Factors can be dropped out during the fit so ideally this
should be a larger number than we think we actually need; but at the
same time more factors mean more computation, so you might start with
a moderate number, and then try increasing later if the data seem to
warrant it.)
f = flash_add_factors_from_data(data, K=10, backfit=FALSE)
Once you have initialized a flash fit object, you can improve the fit
using flash_backfit
. (This can also be done in one step by setting backfit=TRUE
above. If you want to see how the fit is progressing,
set verbose=TRUE
here.)
f = flash_backfit(data, f, verbose=FALSE)
The flash
functions return objects that store useful summary information. The ldf
field stores standardized loadings, factors and weights; fitted_values
stores the product (LDF'); and objective
stores the
measure of goodness-of-fit achieved (the variational lower bound,
F). This can be helpful for comparing multiple fits to the same data.
plot(f$fitted_values, LL %*% t(FF), main="compare fitted values with true LF'") f$ldf$d f$objective
The way we did it above was to add all the factors at once and then optimize.
An alternative is to add one at a time, and optimize each in turn. This is accomplished
using flash_add_greedy
. This will keep adding factors until they no longer improve
the fit, so you need to specify a maximum number to consider.
f_greedy = flash_add_greedy(data, Kmax=10, verbose=FALSE) f_greedy$ldf$d f_greedy$objective
After adding factors in this way you can use backfitting to try to
further improve the fit. In our experience, for different datasets,
sometimes this provides a better fit than initializing all at once (as
in f
above), sometimes worse.
f_greedy_bf = flash_backfit(data, f_greedy, verbose=FALSE) f_greedy_bf$ldf$d f_greedy_bf$objective f$objective
In simulation experiments it might be useful to initialize the loadings or factors to the "truth" to see how it affects the convergence.
f_true = flash_add_factors_from_data(LL %*% t(FF), K=ncol(LL), backfit=FALSE) f_cheat = flash_backfit(data, f_init=f_true, verbose=FALSE) f_cheat$ldf$d f_cheat$objective
Results are pretty similar both in terms of objective achieved and in terms of mean squared error (MSE). Notice that backfitting improves MSE vs greedy.
mean((f_cheat$fitted_values - LL %*% t(FF))^2) mean((f_greedy_bf$fitted_values - LL %*% t(FF))^2) mean((f$fitted_values - LL %*% t(FF))^2) mean((f_greedy$fitted_values - LL %*% t(FF))^2)
The flash algorithm involves iteratively applying a function that
solves the "Empirical Bayes normal means" (EBNM) problem. See Wang and
Stephens (2017) for more details on this. The flashr
package
provides two functions to do this: ebnm_pn
, which uses a
point-normal prior from the ebnm
package, and ebnm_ash
which
provides an interface to functions in the ashr
package.
In principle the ebnm_ash
function fits a more flexible model, so
might be preferred. However, in the paper we generally found the two
methods give similar average performance, and ebnm_pn
is (currently)
faster, and so this is the default.
f_greedy_ash = flash_add_greedy(data, Kmax=10, ebnm_fn="ebnm_ash", verbose=FALSE) f_greedy_ash$objective
Sometimes you might want to include a "fixed" factor or loading in the
analysis. For example, maybe you have a covariate you are interested
in the effect of. Or you want to include a mean term in the rows or
columns. You can add such "fixed" factors using flash_add_fixed_loadings
or
flash_add_fixed_factors
. Fixed loadings will not be updated during a
subsequent fit, but their corresponding factors will be
updated. Similarly, fixed factors will not be updated, but their
loadings will be. (Missing values, NA, in a fixed loading or factor
are allowed and will be updated during a fit.)
For example, the following creates a flash fit object with a fixed intercept loading. (So when fit it will estimate a corresponding factor, which should be interpreted as a column-specific mean.) Then it adds 10 data-based factors and loadings, and then it fits the model. Note how the first loading (the mean) does not change during the fit. (Notice also that the corresponding factor is 0---correct since we did not add a non-zero mean to the simulation.)
f = flash_add_fixed_loadings(data, LL = cbind(rep(1,n)), backfit=TRUE, verbose=FALSE) f = flash_add_factors_from_data(data, K=10, f_init=f, backfit=TRUE, verbose=FALSE) f$nfactors # tells you how many nonzero factors f has head(f$fit$EL[, 1]) # Note that factor is 0 head(f$fit$EF[, 1])
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.