knitr::opts_chunk$set(collapse = TRUE, comment = "#")
knitr::opts_chunk$set(dev = "png", dev.args = list(type = "cairo-png"))

This tutorial demonstrates the core functionality of the lgpr package. We demonstrate how to visualize data, create and fit and additive GP model, study the results, assess covariate relevances and visualize the inferred covariate effects.

require("lgpr")
require("rstan")
require("ggplot2")
set.seed(1932)
sim <- simulate_data(N            = 12,                     
                     t_data       = seq(12, 96, by = 12),
                     covariates   = c(    0,2), # covariate types (2 = cat)
                     lengthscales = c(16,24,1,16), # true effect lengthscales
                     relevances   = c(0,1,1,1), # true covariate relevances
                     names        = c("diseaseAge", "sex"),
                     t_jitter     = 0,  # jitter in time points
                     snr          = 3)    # signal-to-noise ratio
dat <- sim@data
dat$y <- 100*(dat$y + 5)
a <- as.numeric(dat$id)
dat$id <- as.factor(formatC(a, width = 2, flag = "0"))
dat$group <- as.factor(as.numeric(!is.nan(dat$diseaseAge)))
dat$sex <- as.factor(c("Male", "Female")[as.numeric(dat$sex)])
dat$group <- as.factor(c("Control", "Case")[as.numeric(dat$group)])
plot_sim(sim)
plot_sim(sim, comp_idx = 2)

Visualizing longitudinal data

In this tutorial we use simulated testdata_002, which is included in the lgpr package. It consists of 12 individuals and 8 time points for each. The plot_data() function can be used to visualize the data in many ways.

``````r plot_data(testdata_002, facet_by = "id", color_by = "sex") + xlab('Age (months)')

Coloring according to the disease-related age (`diseaseAge`) shows that
individuals 01-06 are cases and 07-12 are controls. The observed disease onset
is at around 60 months for each case individual.

``````r
plot_data(testdata_002, facet_by = "id", color_by = "diseaseAge") + scale_color_gradient2() + xlab('Age (months)')

Creating a model

The vignette "Mathematical description of lgpr models" describes the statistical models and how they can be customized in lgpr. Here we have code examples.

The class lgpmodel represents an additive Gaussian process model. Such models can be created in lgpr using the function create_model(), and they can be fit by calling sample_model(). The easiest way, however, is probably to use the lgp() function which wraps both create_model() and sample_model() and has all the same arguments. The complete information about these arguments is in the documentation, but here we review the most common ones, i.e. formula, data and prior.

In this section we focus on defining a model with create_model(), and the same syntax applies to lgp(). In R, you can see the more help about the mentioned (and any other) functions using ?, as in ?lgp.

Specifying the data

The data argument to lgp() or create_model() should be a data.frame where continuous variables have a numeric type and categorical variables are factors. Our test data is already in this format.

``````r str(testdata_002)

If you don't know how to create a `data.frame` for your own data,
consult for example 
[this](https://bookdown.org/ndphillips/YaRrr/matricesdataframes.html) or [this](https://www.tutorialspoint.com/r/r_data_frames.htm).
Functions in `lgpr` that can help in adding new variables to a data frame are
`add_factor()` and `add_dis_age()`.

## Specifying the model formula
The `formula` argument to `lgp()` or `create_model()` specifies
the response variable, model components and covariates.

### Common formula syntax
Here we create a model with the continuous variable `age` and categorical
variables `sex` and subject `id` as predictors for `y`.

```r
model <- create_model(y ~ age + age|sex + age|id, testdata_002, verbose = FALSE)
print(model)

The formula y ~ age + age|sex + age|id has the same format as for example in the lme4 package, which uses linear mixed effect models. In lgpr, this formula creates a model with three components:

  1. the shared effect of age
  2. the sex-specific deviation from the shared age effect
  3. the subject-specific deviation from the shared age effect

If we wanted effects 2. and 3. to not depend on age, we could write the formula as just y ~ age + sex + id.

Advanced formula syntax

The above formula was actually automatically converted to the more specific format y ~ gp(age) + gp(age)*zs(sex) + gp(age)*zs(id). The lgpr package has this advanced syntax, because the common formula syntax of R is not expressive enough for all kinds of components that our models can have. Formulae specified using the advanced syntax consist of terms which can combine the following expressions through addition and multiplication:

Above, x was used to denote any continuous and z any categorical variable. Each term can contain at most one continuous variable. Components which contain gp(x), gp_ns(x), or gp_vm(x) and have NaN or NA in the data of x are automatically multiplied by a mask for the missing values.

Here we define the formula using the advanced syntax.

model <- create_model(y ~ gp(age) + zs(id)*gp(age) + zs(sex)*gp(age) + gp_vm(diseaseAge) + zs(group),
                      testdata_002)

We received a warning because we used gp_vm() without specifying a prior. This is because the default prior only makes sense if the covariate (diseaseAge) is expressed in months and the measurement time interval is several years. We will get back to this in Section 2.5

print(model)

Model parameters

As can be seen in the above output, the model has five magnitude parameters (alpha), one for each component, and three lengthscale parameters (ell), since the zs(group) component does not have a lengthscale parameter. The other parameters are the input warping steepness parameter (wrp) for the nonstationary gp_vm(diseaseAge) component, and the Gaussian noise magnitude parameter sigma.

Note: the continuous covariate age and the response variable y are normalized to have zero mean and unit variance, and the inference of the lengthscale, magnitude, and noise parameters is done on that scale.

Specifying parameter priors

The prior argument to lgp() or create_model() specifies the prior distribution for kernel hyperparameters and other model parameters. In above model, we did not speficy prior, so default priors were used. Custom priors can be given as a named list, like here:

prior <- list(
  alpha = normal(mu = 0, sigma = 1), # gaussian for magnitudes <alpha>
  ell = igam(shape = 5, scale = 5) # inverse gamma for lengthscales <ell>
)

model <- create_model(y ~ age + age|sex + age|id,
                      data = testdata_002, 
                      prior = prior)
param_summary(model)

Specifying the prior argument as a named list gave the same prior for all parameters matching that name. If we wanted separate priors for example for each lengthscale parameter, we could specify ell itself as a list with length 3.

#To check whether a given prior makes sense, we can sample from the prior
#predictive distribution
prior_fit <- lgp(distance ~ age + age|Sex + age|Subject, dat,
           sample_f = TRUE,
           prior_only = TRUE,
           iter = 1000,
           refresh = 200,
           chains = 1)

#and compare the distribution of drawn "data" to the real data
#See more about predictive checks [here](https://cran.r-project.org/web/packages/bayesplot/vignettes/graphical-ppcs.html).
print(prior_fit)
ppc(prior_fit, dat)

When to not use default priors

It is not recommended to use default priors blindly. Rather, priors should be specified according to the knowledge about the problem at hand, as in any Bayesian analysis. In lgpr this is especially important when

Prior for the input warping steepness

The disease-related age diseaseAge is not normalized, and we need to take its range into account when setting the prior for the wrp parameter. Here we define a log-normal prior, take draws from this prior and visualize the input warping function using the drawn steepness values.

num_draws <- 300
mu_wrp <- -0.7
sigma_wrp <- 0.3
r <- range(testdata_002$diseaseAge, na.rm = TRUE)

# Define prior for lgp()
my_prior <- list(wrp = log_normal(mu_wrp, sigma_wrp))

# Prior draws
wrp_draws <- stats::rlnorm(num_draws, mu_wrp, sigma_wrp)

# Plot corresponding input warping functions
# - this function is not exported from lgpr so we need to use triple colons :::
# - alpha here is line opacity, not a GP parameter
lgpr:::plot_inputwarp(wrp_draws, seq(r[1], r[2], by = 1), alpha = 0.1)

Higher wrp values mean more rapid changes in the nonstationary component and lower values mean slower changes. We see that our prior seems to allow changes in the input warping function only in the interval $[-20, 20]$. This means that all possible variation in the diseaseAge component occurs at most $\pm 20$ months before/after the disease effect time/onset.

Fitting a model

A model created using create_model() could be fitted using the sample_model() function. However, in this section we show how to do both model creation and model fitting using the lgp() function.

Sampling the posterior

Here we define a model with four components and fit it. The gp(age) component is a shared age effect, zs(sex)*gp(age) is a sex-specific deviation from it, gp_vm(diseaseAge) is a non-stationary variance-masked disease effect, and zs(group) is a static offset component between the two groups (Case/Control).

fit <- lgp(y ~ gp(age) + zs(id)*gp(age) + zs(sex)*gp(age) + gp_vm(diseaseAge) + zs(group),
           data     = testdata_002,
           prior    = my_prior, # defined earlier
           iter     = 2000,
           chains   = 4,
           refresh  = 500)

The lgp() function can be given any arguments that should be passed to rstan::sampling(). We could use for example cores = 4 to parallelize the chains or control = list(adapt_delta = 0.95) to set the target acceptance rate.

Studying the posterior

Printing the fit object gives info about the posterior distribution of the parameters, MCMC settings and convergence diagnostics.

print(fit)

Distribution of the parameter draws can be visualized.

plot_draws(fit, type = 'dens')

The slot fit@stan_fit is a stanfit object and can therefore be studied more as is done here

sf <- fit@stan_fit
sampler_params <- rstan::get_sampler_params(sf, inc_warmup = FALSE)
myfun <- function(x) mean(x[, "accept_stat__"])
mean_accept_stat_by_chain <- sapply(sampler_params, myfun)
print(mean_accept_stat_by_chain)

Assessing component relevances

Here we compute the average relevance of each component. We see that id and group have a very low relevance.

Relevance <- relevances(fit, reduce = mean)
data.frame(Relevance)

Components can be selected using a threshold for the proportion of total explained variance. We see that id and group are not selected.

select(fit, threshold = 0.95)

In order to avoid having to set a fixed threshold for selection, we can integrate over a threshold density on the [0,1] interval, and obtain probabilistic selection. Here we define the density to be stats::dbeta(x, 20, 2), which has most mass between $0.9$ and $1.0$.

threshold_density <- function(x) {stats::dbeta(x, 20, 2)}
s <- select.integrate(fit, p = threshold_density)
print(s$expected)

The above table shows for each component the expectation value of being selected.

Out-of-sample predictions

We visualize the predictive distribution of the model, using posterior mean hyperparameters (reduce = mean). The predictive mean ($\pm 2 \times$ standard deviation) are computed at 120 points for each of the 12 individuals (1440 total prediction points created using new_x()) and visualized against the data.

t <- seq(1, 120, by = 1)
x_pred <- new_x(testdata_002, t, x_ns = "diseaseAge")
p <- pred(fit, x_pred, reduce = mean) # use posterior mean kernel parameters
plot_pred(fit, pred = p)

Visualizing the inferred covariate effects

Also the posterior distribution of each component can be visualized

plot_components(fit,
                pred = p,
                color_by = c(NA, NA, "sex", "group", "group", "group"),
                ylim = c(-3,2),
                ncol = 2)

Additional notes

Note: the plot_pred() function scales the predictions back to the original data scale, whereas the plot_components() function plots the inferred functions on the (normalized) inference scale.

Note: the plot_pred() and plot_components() functions can be used without the x and pred arguments, too, in which case computing out-of-sample predictions is not be needed. The predictive and component posteriors are then shown only at the data points.

plot_pred(fit)
plot_components(fit,                 
                color_by = c(NA, NA, "sex", "group", "group", "group"),
                ylim = c(-3,2),
                ncol = 2)

Computing environment

Below is the original environment that was used to run this tutorial.

sessionInfo()


jtimonen/lgpr documentation built on Oct. 12, 2023, 11:13 p.m.