rrmc | R Documentation |
rrmc()
implements the non-convex PCP algorithm "Rank-based robust matrix
completion" as described in
Cherapanamjeri et al. (2017)
(see Algorithm 3), outfitted with environmental health (EH)-specific
extensions as described in Gibson et al. (2022).
Given an observed data matrix D
, maximum rank to search up to r
, and
regularization parameter eta
, rrmc()
seeks to find the best low-rank
and sparse estimates L
and S
using an incremental rank-based strategy.
The L
matrix encodes latent patterns that govern the observed data.
The S
matrix captures any extreme events in the data unexplained by the
underlying patterns in L
.
rrmc()
's incremental rank-based strategy first estimates a rank-1
model
(L^{(1)}, S^{(1)})
, before using the rank-1
model as the initialization
point to then construct a rank-2
model (L^{(2)}, S^{(2)})
, and so on,
until the desired rank-r
model (L^{(r)}, S^{(r)})
is recovered. All
models from ranks 1
through r
are returned by rrmc()
in this way.
Experimentally, the rrmc()
approach to PCP has best been able to handle
those datasets that are governed by complex underlying patterns characterized
by slowly decaying singular values, such as EH data. For observed data with a
well-defined low rank structure (rapidly decaying singular values),
root_pcp()
may offer a better model estimate.
Two EH-specific extensions are currently supported by rrmc()
:
The model can handle missing values in the input data matrix D
; and
The model can also handle measurements that fall below the limit of
detection (LOD), if provided LOD
information by the user.
Support for a non-negativity constraint on rrmc()
's output will be added in
a future release of pcpr
.
rrmc(D, r, eta = NULL, LOD = -Inf)
D |
The input data matrix (can contain |
r |
An integer |
eta |
(Optional) A double in the range |
LOD |
(Optional) The limit of detection (LOD) data. Entries in
By default, |
A list containing:
L
: The rank-r
low-rank matrix encoding the r
-many latent patterns
governing the observed input data matrix D
. dim(L)
will be the same
as dim(D)
. To explicitly obtain the underlying patterns, L
can be
used as the input to any matrix factorization technique of choice, e.g.
PCA, factor analysis, or non-negative matrix factorization.
S
: The sparse matrix containing the rare outlying or extreme
observations in D
that are not explained by the underlying patterns in
the corresponding L
matrix. dim(S)
will be the same as dim(D)
.
Most entries in S
are 0
, while non-zero entries identify the extreme
outlying observations in D
.
L_list
: A list of the r
-many L
matrices recovered over the course
of rrmc()
's iterative optimization procedure. The first element in
L_list
corresponds to the rank-1
L
matrix, the second to the
rank-2
L
matrix, and so on.
S_list
: A list of the r
-many corresponding S
matrices recovered
over the course of rrmc()
's iterative optimization procedure. The first
element in S_list
corresponds to the rank-1
solution's S
matrix,
the second to the rank-2
solution's S
matrix, and so on.
objective
: A vector containing the values of rrmc()
's objective
function over the course of optimization.
rrmc()
implicitly optimizes the following objective function:
\min_{L, S} I_{rank(L) \leq r} + \eta ||S||_0 + ||L + S - D||_F^2
The first term is the indicator function checking that the L
matrix is
strictly rank r
or less, implemented using a rank r
projection operator
proj_rank_r()
. The second term is the \ell_0
norm applied to the S
matrix to encourage sparsity, and is implemented with the help of an adaptive
hard-thresholding operator hard_threshold()
. The third term is the squared
Frobenius norm applied to the model's noise.
eta
parameterThe eta
parameter scales the sparse penalty applied to rrmc()
's output
sparse S
matrix. Larger values of eta
penalize non-zero entries in S
more stringently, driving the recovery of sparser S
matrices.
Because there are no other parameters scaling the other terms in rrmc()
's
objective function, eta
can intuitively be thought of as the dial that
balances the model's sensitivity to extreme events (placed in S
) and
its sensitivity to noise Z
(captured by the last term in the objective,
which measures the discrepancy between the between the predicted model
and the observed data). Larger values of eta
will place a
greater emphasis on penalizing the non-zero entries in S
over penalizing
the errors between the predicted and observed data Z = L + S - D
.
We refer interested readers to Gibson et al. (2022) for the complete details regarding the EH-specific extensions.
Missing value functionality: PCP assumes that the same data generating
mechanisms govern both the missing and the observed entries in D
. Because
PCP primarily seeks accurate estimation of patterns rather than
individual observations, this assumption is reasonable, but in some edge
cases may not always be justified. Missing values in D
are therefore
reconstructed in the recovered low-rank L
matrix according to the
underlying patterns in L
. There are three corollaries to keep in mind
regarding the quality of recovered missing observations:
Recovery of missing entries in D
relies on accurate estimation of
L
;
The fewer observations there are in D
, the harder it is to accurately
reconstruct L
(therefore estimation of both unobserved and observed
measurements in L
degrades); and
Greater proportions of missingness in D
artifically drive up the
sparsity of the estimated S
matrix. This is because it is not possible
to recover a sparse event in S
when the corresponding entry in D
is
unobserved. By definition, sparse events in S
cannot be explained by
the consistent patterns in L
. Practically, if 20% of the entries in D
are missing, then at least 20% of the entries in S
will be 0.
Handling measurements below the limit of detection: When equipped with
LOD information, PCP treats any estimations of values known to be below the
LOD as equally valid if their approximations fall between 0 and the LOD. Over
the course of optimization, observations below the LOD are pushed into this
known range [0, LOD]
using penalties from above and below: should a
< LOD
estimate be < 0
, it is stringently penalized, since
measured observations cannot be negative. On the other hand, if a < LOD
estimate is >
the LOD, it is also heavily penalized: less so than when
< 0
, but more so than observations known to be above the LOD, because
we have prior information that these observations must be below LOD.
Observations known to be above the LOD are penalized as usual, using the
Frobenius norm in the above objective function.
Gibson et al. (2022) demonstrates that
in experimental settings with up to 50% of the data corrupted below the LOD,
PCP with the LOD extension boasts superior accuracy of recovered L
models
compared to PCA coupled with LOD / \sqrt{2}
imputation. PCP even
outperforms PCA in low-noise scenarios with as much as 75% of the data
corrupted below the LOD. The few situations in which PCA bettered PCP were
those pathological cases in which D
was characterized by extreme noise and
huge proportions (i.e., 75%) of observations falling below the LOD.
Cherapanamjeri, Yeshwanth, Kartik Gupta, and Prateek Jain. "Nearly optimal robust matrix completion." International Conference on Machine Learning. PMLR, 2017. [available here]
Gibson, Elizabeth A., Junhui Zhang, Jingkai Yan, Lawrence Chillrud, Jaime Benavides, Yanelli Nunez, Julie B. Herbstman, Jeff Goldsmith, John Wright, and Marianthi-Anna Kioumourtzoglou. "Principal component pursuit for pattern identification in environmental mixtures." Environmental Health Perspectives 130, no. 11 (2022): 117008.
root_pcp()
#### -------Simple simulated PCP problem-------####
# First we will simulate a simple dataset with the sim_data() function.
# The dataset will be a 100x10 matrix comprised of:
# 1. A rank-3 component as the ground truth L matrix;
# 2. A ground truth sparse component S w/outliers along the diagonal; and
# 3. A dense Gaussian noise component
data <- sim_data()
# Best practice is to conduct a grid search with grid_search_cv() function,
# but we skip that here for brevity.
pcp_model <- rrmc(data$D, r = 3, eta = 0.35)
data.frame(
"Observed_relative_error" = norm(data$L - data$D, "F") / norm(data$L, "F"),
"PCA_error" = norm(data$L - proj_rank_r(data$D, r = 3), "F") / norm(data$L, "F"),
"PCP_L_error" = norm(data$L - pcp_model$L, "F") / norm(data$L, "F"),
"PCP_S_error" = norm(data$S - pcp_model$S, "F") / norm(data$S, "F")
)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.