knitr::opts_chunk$set(collapse = TRUE, comment = "#") knitr::opts_chunk$set(dev = "png", dev.args = list(type = "cairo-png"))
This tutorial demonstrates the core functionality of the lgpr
package. We
demonstrate how to visualize data, create and fit and additive GP model,
study the results, assess covariate relevances and visualize the inferred
covariate effects.
require("lgpr") require("rstan") require("ggplot2")
set.seed(1932) sim <- simulate_data(N = 12, t_data = seq(12, 96, by = 12), covariates = c( 0,2), # covariate types (2 = cat) lengthscales = c(16,24,1,16), # true effect lengthscales relevances = c(0,1,1,1), # true covariate relevances names = c("diseaseAge", "sex"), t_jitter = 0, # jitter in time points snr = 3) # signal-to-noise ratio dat <- sim@data dat$y <- 100*(dat$y + 5) a <- as.numeric(dat$id) dat$id <- as.factor(formatC(a, width = 2, flag = "0")) dat$group <- as.factor(as.numeric(!is.nan(dat$diseaseAge))) dat$sex <- as.factor(c("Male", "Female")[as.numeric(dat$sex)]) dat$group <- as.factor(c("Control", "Case")[as.numeric(dat$group)]) plot_sim(sim) plot_sim(sim, comp_idx = 2)
In this tutorial we use simulated testdata_002
, which is included in the
lgpr
package. It consists of 12 individuals and 8 time points for each. The
plot_data()
function can be used to visualize the data in many ways.
``````r plot_data(testdata_002, facet_by = "id", color_by = "sex") + xlab('Age (months)')
Coloring according to the disease-related age (`diseaseAge`) shows that individuals 01-06 are cases and 07-12 are controls. The observed disease onset is at around 60 months for each case individual. ``````r plot_data(testdata_002, facet_by = "id", color_by = "diseaseAge") + scale_color_gradient2() + xlab('Age (months)')
The vignette "Mathematical description of lgpr models"
describes the statistical models and how they can be customized in lgpr
.
Here we have code examples.
The class lgpmodel
represents an additive Gaussian process model. Such models
can be created in lgpr
using the function create_model()
, and they
can be fit by calling sample_model()
. The easiest way, however, is probably
to use the lgp()
function which wraps both create_model()
and
sample_model()
and has all the same arguments. The complete
information about these arguments is in the documentation, but here we review
the most common ones, i.e. formula
, data
and prior
.
In this section we focus on defining a model with
create_model()
, and the same syntax applies to lgp()
.
In R, you can see the more help about the mentioned (and any other) functions
using ?
, as in ?lgp
.
The data
argument to lgp()
or create_model()
should be a data.frame
where continuous variables have a numeric type and categorical variables
are factors. Our test data is already in this format.
``````r str(testdata_002)
If you don't know how to create a `data.frame` for your own data, consult for example [this](https://bookdown.org/ndphillips/YaRrr/matricesdataframes.html) or [this](https://www.tutorialspoint.com/r/r_data_frames.htm). Functions in `lgpr` that can help in adding new variables to a data frame are `add_factor()` and `add_dis_age()`. ## Specifying the model formula The `formula` argument to `lgp()` or `create_model()` specifies the response variable, model components and covariates. ### Common formula syntax Here we create a model with the continuous variable `age` and categorical variables `sex` and subject `id` as predictors for `y`. ```r model <- create_model(y ~ age + age|sex + age|id, testdata_002, verbose = FALSE) print(model)
The formula y ~ age + age|sex + age|id
has the same format as for example in
the lme4
package, which uses linear mixed effect models. In lgpr
, this
formula creates a model with three components:
If we wanted effects 2. and 3. to not depend on age, we could write the formula
as just y ~ age + sex + id
.
The above formula was actually automatically converted to the more specific
format y ~ gp(age) + gp(age)*zs(sex) + gp(age)*zs(id)
. The lgpr
package has
this advanced syntax, because the common formula syntax of R is not
expressive enough for all kinds of components that our models can have.
Formulae specified using the advanced syntax consist of terms which can combine
the following expressions through addition and multiplication:
gp(x)
specifies a standard GP with exponentiated quadratic kernelgp_ns(x)
specifies a GP with a nonstationary kernelgp_vm(x)
specifies a GP with a variance-masked kernelzs(z)
specifies a GP with zero-sum kernelcateg(z)
specifies a GP with categorical kernelunc(z)
have uncertainty in their
continuous covariate, so that each level of z
introduces one uncertainty
parameterhet(z)
are have heterogeneity in the effect of their
continuous covariate, so that each level of z
introduces one level-specific
magnitude parameterAbove, x
was used to denote any continuous and z
any categorical variable.
Each term can contain at most one continuous variable. Components which contain
gp(x)
, gp_ns(x)
, or gp_vm(x)
and have NaN
or NA
in the data of x
are automatically multiplied by a mask for the missing values.
Here we define the formula using the advanced syntax.
model <- create_model(y ~ gp(age) + zs(id)*gp(age) + zs(sex)*gp(age) + gp_vm(diseaseAge) + zs(group), testdata_002)
We received a warning because we used gp_vm()
without specifying a prior.
This is because the default prior only makes sense if the covariate
(diseaseAge
) is expressed in months and the measurement time interval
is several years. We will get back to this in Section 2.5
print(model)
As can be seen in the above output, the model has five magnitude parameters
(alpha
), one for each component, and three lengthscale parameters (ell
),
since the zs(group)
component does not have a lengthscale parameter. The
other parameters are the input warping steepness parameter (wrp
) for the
nonstationary gp_vm(diseaseAge)
component, and the Gaussian noise magnitude
parameter sigma
.
Note: the continuous covariate age
and the response variable y
are
normalized to have zero mean and unit variance, and the inference of the
lengthscale, magnitude, and noise parameters is done on that scale.
The prior
argument to lgp()
or create_model()
specifies the prior
distribution for kernel hyperparameters and other model parameters. In above
model, we did not speficy prior
, so default priors were
used. Custom priors can be given as a named list, like here:
prior <- list( alpha = normal(mu = 0, sigma = 1), # gaussian for magnitudes <alpha> ell = igam(shape = 5, scale = 5) # inverse gamma for lengthscales <ell> ) model <- create_model(y ~ age + age|sex + age|id, data = testdata_002, prior = prior) param_summary(model)
Specifying the prior
argument as a named list gave the same prior for
all parameters matching that name. If we wanted separate priors for example
for each lengthscale parameter, we could specify ell
itself as a list
with length 3.
#To check whether a given prior makes sense, we can sample from the prior #predictive distribution prior_fit <- lgp(distance ~ age + age|Sex + age|Subject, dat, sample_f = TRUE, prior_only = TRUE, iter = 1000, refresh = 200, chains = 1) #and compare the distribution of drawn "data" to the real data #See more about predictive checks [here](https://cran.r-project.org/web/packages/bayesplot/vignettes/graphical-ppcs.html). print(prior_fit) ppc(prior_fit, dat)
It is not recommended to use default priors blindly. Rather, priors should
be specified according to the knowledge about the problem at hand, as in any
Bayesian analysis. In lgpr
this is especially important when
sample_f = TRUE
.
In this case the response variable is not
normalized, so the scale on which the data varies must be taken into
account when defining priors of the signal magnitude parameters
\code{alpha} and possible noise parameters (sigma
, phi
,
gamma
). Also it should be checked if c_hat
is set in a
sensible way.gp_ns(x)
or gp_vm(x)
expression in its formula. In this case the corresponding covariate
x
is not normalized, and the prior for the input warping steepness
parameter wrp
must be set according to the expected width of the
window in which the nonstationary effect of x
occurs. By default,
the width of this window is about 36, which has been set assuming that
the unit of x
is months.The disease-related age diseaseAge
is not normalized, and we need to take
its range into account when setting the prior for the wrp
parameter. Here
we define a log-normal prior, take draws from this prior and visualize the
input warping function using the drawn steepness values.
num_draws <- 300 mu_wrp <- -0.7 sigma_wrp <- 0.3 r <- range(testdata_002$diseaseAge, na.rm = TRUE) # Define prior for lgp() my_prior <- list(wrp = log_normal(mu_wrp, sigma_wrp)) # Prior draws wrp_draws <- stats::rlnorm(num_draws, mu_wrp, sigma_wrp) # Plot corresponding input warping functions # - this function is not exported from lgpr so we need to use triple colons ::: # - alpha here is line opacity, not a GP parameter lgpr:::plot_inputwarp(wrp_draws, seq(r[1], r[2], by = 1), alpha = 0.1)
Higher wrp
values mean more rapid changes in the nonstationary component and
lower values mean slower changes. We see that our prior seems to allow changes
in the input warping function only in the interval $[-20, 20]$. This means that
all possible variation in the diseaseAge
component occurs at most
$\pm 20$ months before/after the disease effect time/onset.
A model created using create_model()
could be fitted using the
sample_model()
function. However, in this section we show
how to do both model creation and model fitting using the lgp()
function.
Here we define a model with four components and fit it. The gp(age)
component
is a shared age effect, zs(sex)*gp(age)
is a sex-specific deviation from it,
gp_vm(diseaseAge)
is a non-stationary variance-masked disease effect, and
zs(group)
is a static offset component between the two groups (Case/Control).
fit <- lgp(y ~ gp(age) + zs(id)*gp(age) + zs(sex)*gp(age) + gp_vm(diseaseAge) + zs(group), data = testdata_002, prior = my_prior, # defined earlier iter = 2000, chains = 4, refresh = 500)
The lgp()
function can be given any arguments that should be passed to
rstan::sampling().
We could use for example cores = 4
to parallelize the chains or
control = list(adapt_delta = 0.95)
to set the target acceptance rate.
Printing the fit object gives info about the posterior distribution of the parameters, MCMC settings and convergence diagnostics.
print(fit)
Distribution of the parameter draws can be visualized.
plot_draws(fit, type = 'dens')
The slot fit@stan_fit
is a stanfit
object and can therefore be studied more
as is done
here
sf <- fit@stan_fit sampler_params <- rstan::get_sampler_params(sf, inc_warmup = FALSE) myfun <- function(x) mean(x[, "accept_stat__"]) mean_accept_stat_by_chain <- sapply(sampler_params, myfun) print(mean_accept_stat_by_chain)
Here we compute the average relevance of each component. We see that id
and
group
have a very low relevance.
Relevance <- relevances(fit, reduce = mean) data.frame(Relevance)
Components can be selected using a threshold for the proportion of total
explained variance. We see that id
and group
are not selected.
select(fit, threshold = 0.95)
In order to avoid having to set a fixed threshold
for selection, we can
integrate over a threshold density on the [0,1] interval, and obtain
probabilistic selection. Here we define the density to be
stats::dbeta(x, 20, 2)
, which has most mass between $0.9$ and $1.0$.
threshold_density <- function(x) {stats::dbeta(x, 20, 2)} s <- select.integrate(fit, p = threshold_density) print(s$expected)
The above table shows for each component the expectation value of being selected.
We visualize the predictive distribution of the model, using posterior
mean hyperparameters (reduce = mean
). The predictive mean ($\pm 2 \times$
standard deviation) are computed at 120 points for each of the 12 individuals
(1440 total prediction points created using new_x()
) and visualized against
the data.
t <- seq(1, 120, by = 1) x_pred <- new_x(testdata_002, t, x_ns = "diseaseAge") p <- pred(fit, x_pred, reduce = mean) # use posterior mean kernel parameters plot_pred(fit, pred = p)
Also the posterior distribution of each component can be visualized
plot_components(fit, pred = p, color_by = c(NA, NA, "sex", "group", "group", "group"), ylim = c(-3,2), ncol = 2)
Note: the plot_pred()
function scales the predictions back to the
original data scale, whereas the plot_components()
function plots the
inferred functions on the (normalized) inference scale.
Note: the plot_pred()
and plot_components()
functions can be used
without the x
and pred
arguments, too, in which case computing
out-of-sample predictions is not be needed. The predictive and component
posteriors are then shown only at the data points.
plot_pred(fit) plot_components(fit, color_by = c(NA, NA, "sex", "group", "group", "group"), ylim = c(-3,2), ncol = 2)
Below is the original environment that was used to run this tutorial.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.