declare_latent | R Documentation |
Latent variables are unobserved variables which, for given model parameters, are random. Since they are unobserved, they cannot appear in data nor used to infer parameters. However, they can be predicted if their joint distribution with the data is learned from the reference table. Thus for inference about latent variables, these should be returned along summary statistics by the simulation function generating the samples for the reference table, but they should be declared as such so that later inference steps can distinguish them from both parameters and summary statistics. The declare_latent
function is used for that purpose.
The pplatent
function can be used to point-predict latent values, by their inferred mean or median given the (projected) summary statistics and fitted parameter values.
The latint
function provides prediction intervals for the latent variable, accounting for uncertainty in parameter estimates, using a bootstrap method (see Details).
declare_latent(reftable, latentVars)
pplatent(object, type="mean",
newDP=NULL,
sumstats= t(get_from(object,"stat.obs")),
pars=t(object$MSL$MSLE),
size=1000L, ...)
latint(object, nsim=199L, levels=c(0.025,0.975),
sumstats= t(get_from(object,"stat.obs")),
Simulate, control.Simulate=get_from(object,"control.Simulate"),
bootSamples=NULL,
...)
reftable |
A reference table of simulation as returned by |
latentVars |
A vector of names of variables to be treated as latent variables. |
object |
An object of class |
type |
Character: the only handled non-default value is |
newDP , sumstats , pars |
Matrices of data and/or (projected) summary statistics, with one column for each variable. |
levels |
Numeric vector: one-sided confidence levels (cumulative probabilities at which quantiles of the predictive distribution will be returned). |
nsim |
Integer: number of bootstrap replicates. |
size |
Integer: number of draws for estimating the median. |
Simulate |
May be used to provide the simulation function if it is not stored in the |
control.Simulate |
A list of arguments of the |
bootSamples |
A data frame. The bootstrap samples may be provided by this argument, otherwise they will be automatically simulated by the function provided by the |
... |
For |
The latint
function aims to provide intervals for the latent variable V
, with controlled coverage for given parameter values. The parametric bootstrap method of Lawless & Fredette (2005) can be adapted to the present problem. First, new data D^*
and new latent values V^*
are jointly simulated, given the summary-ML estimates (this uses the sample-generating process simulator, but these simulations are not added to the reference table). Second, their original method requires the evaluation for each new (D^*,V^*)
of F_V(V=V^*|D^*;\theta(D^*))
, the value of the cumulative distribution function F_V
of V
evaluated at V=V^*
values, given D^*
and given new parameter estimates \theta(D^*)
.
It also requires the quantile function of V
for the original data and parameter estimates. latint
uses the fit of the multivariate gaussian mixture model to the reference table, stored in the fit object, for fast parameter refitting and for fast estimation of these functions on each bootstrap sample (D^*,V^*)
.
latint(., levels=c(0.5))
will return the median of the predictive distribution,
conceptually distinct from the plug-in prediction by pplatent(slik_j, type="median")
Quantile computations in pplatent
and latint
are approximate and might be modified in some future version, but were sufficient to obtain reasonable results in simulations.
declare_latent
returns the input reftable
with modified attributes. pplatent
returns a vector of predicted latent values.
pplatent
returns a single numeric value or a vector.
latint
returns a list of matrices. Each matrix has with one column per element of levels
, and one row per row of sumstats
. There is one such matrix for each latent variable.
Lawless J. F., Fredette M. (2005) Frequentist prediction intervals and predictive distributions. Biometrika 92: 529–542, <doi:10.1093/biomet/92.3.529>
## Not run:
##### A toy example motivated by some inference problem for genomic data #####
# A model with three parameters logNe, Vs and Es is fitted.
# Data from 100 loci are here summarized by three genome-wide summary statistics
# (slogNe, sVs and sEs), and one locus-specific statistic that will provide
# information about a locus-specific latent variable.
## Simulation function
genomloc <- function(logNe=parvec["logNe"],Es=parvec["Es"],Vs=parvec["Vs"],
latent=TRUE, # returns the latent value by default
parvec) {
slogNe <- rnorm(1,logNe, sd=0.3)
genom_s <- rgamma(99, shape=Es/Vs,scale=Vs) # all loci except the focal one
sEs <- mean(genom_s)
sVs <- var(genom_s)
latentv <- rgamma(1, shape=Es/Vs,scale=Vs) # locus-specific latent variable to predict
sloc <- rnorm(1, mean=latentv-sEs,sd=latentv/3) # locus-specific statistic
resu <- list(slogNe=slogNe,sEs=sEs,sVs=sVs, sloc=sloc)
if (latent) resu$latentv <- latentv
unlist(resu)
}
#
## simulated data, standing for the actual data to be analyzed:
set.seed(123)
Sobs <- genomloc(logNe=4,Es=0.05, Vs=0.1,latent=FALSE) ## no latent value
#
workflow_design <- get_workflow_design(npar=3L, n_proj_stats=4L, n_latent=1L)
parsp <- init_reftable(lower=c(logNe=2,Es=0.001,Vs=0.001),
upper=c(logNe=6,Es=0.2,Vs=0.2),
nUnique=workflow_design$init_reft_size)
simuls <- add_reftable(Simulate=genomloc, parsTable=parsp)
simuls <- declare_latent(simuls,"latentv")
## Projections are not necessary here since the number of statistics is minimal,
# but will be discussed later.
{ ############ Without projections
{ ## Usual workflow for estimation:
densv <- infer_SLik_joint(simuls,stat.obs=Sobs)
slik_j <- MSL(densv) ## find the maximum of the log-likelihood surface
slik_j <- refine(slik_j,maxit=2,update_projectors=TRUE)
# plot1Dprof(slik_j) ## 1D profiles show parameter inference is OK
}
{ ## inference about latent values:
pplatent(slik_j)
pplatent(slik_j, type="median")
latint(slik_j, nsim=999, levels=c(0.025,0.5,0.975))
}
{ ## Assessing prediction of latent variable:
# Builds testing set:
test_simuls <- t(replicate(1000, genomloc(logNe=4,Es=0.05, Vs=0.1)))
test_data <- test_simuls[,-5]
# Point prediction:
pred <- pplatent(slik_j, sumstats = test_data)
plot(test_simuls[,"latentv"], pred); abline(0,1) # prediction vs. true latent values
}
}
{ ########## Beyond standard use of projections for estimation of parameter values,
# projections can also be used when several individual-level statistics inform about
# the latent variable, to reduce them to a single summary statistic.
# Projection will then be needed at the prediction step too.
{ # projection with latent variable as response:
platent <- (project("latentv", data=simuls, stats=c("slogNe","sEs","sVs","sloc")))
# (This example only serves to show the syntax since no dimention reduction occurs)
dprojectors <- list(SLOC=platent,slogNe=NULL,sEs=NULL, sVs=NULL)
# => As soon as one projection is used, The 'projectors' argument must include
# all projectors used for the inference, whether for parameters or for latent variables.
# NULL projectors should then be declared for raw statistics retained
# in the projected reference table.
# Apply projections on simulated statistics and 'data':
projSimuls <- project(simuls,projectors=dprojectors,verbose=FALSE)
projSobs <- project(Sobs,projectors=dprojectors)
}
{ ## Estimation:
ddensv <- infer_SLik_joint(projSimuls,stat.obs=projSobs)
dslik_j <- MSL(ddensv) ## find the maximum of the log-likelihood surface
dslik_j <- refine(dslik_j,maxit=2,update_projectors=TRUE)
# plot1Dprof(dslik_j)
}
{ ## Assessing prediction of latent variable: do not forget to project!
test_simuls <- t(replicate(1000, genomloc(logNe=4,Es=0.05, Vs=0.1)))
test_data <- test_simuls[,-5] # removing column of latent variable
ptest_data <- project(test_data,projectors=dprojectors,verbose=FALSE) # Here!
pred <- pplatent(dslik_j, sumstats = ptest_data)
plot(test_simuls[,"latentv"], pred); abline(0,1)
}
}
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.