sim_mHMM: Simulate data using a multilevel hidden Markov model

View source: R/sim_mHMM.R

sim_mHMMR Documentation

Simulate data using a multilevel hidden Markov model

Description

sim_mHMM simulates data for multiple subjects, for which the data have either categorical or continuous (i.e., normally distributed) observations that follow a hidden Markov model (HMM) with a multilevel structure. The multilevel structure implies that each subject is allowed to have its own set of parameters, and that the parameters at the subject level (level 1) are tied together by a population distribution at level 2 for each of the corresponding parameters. The shape of the population distribution for each of the parameters is a normal distribution. In addition to (natural and/or unexplained) heterogeneity between subjects, the subjects parameters can also depend on a covariate.

Usage

sim_mHMM(
  n_t,
  n,
  data_distr = "categorical",
  gen,
  gamma,
  emiss_distr,
  start_state = NULL,
  xx_vec = NULL,
  beta = NULL,
  var_gamma = 0.1,
  var_emiss = NULL,
  return_ind_par = FALSE,
  m,
  n_dep,
  q_emiss
)

Arguments

n_t

Numeric vector with length 1 denoting the length of the observed sequence to be simulated for each subject. To only simulate subject specific transition probability matrices gamma and emission distributions (and no data), set t to 0.

n

Numeric vector with length 1 denoting the number of subjects for which data is simulated.

data_distr

String vector with length 1 describing the observation type of the data. Currently supported are 'categorical' and 'continuous'. Note that for multivariate data, all dependent variables are assumed to be of the same observation type. The default equals to data_distr = 'categorical'.

gen

List containing the following elements denoting the general model properties:

  • m: numeric vector with length 1 denoting the number of hidden states

  • n_dep: numeric vector with length 1 denoting the number of dependent variables

  • q_emiss: only to be specified if the data represents categorical data. Numeric vector with length n_dep denoting the number of observed categories for the categorical emission distribution for each of the dependent variables.

gamma

A matrix with m rows and m columns containing the average population transition probability matrix used for simulating the data. That is, the probability to switch from hidden state i (row i) to hidden state j (column j).

emiss_distr

A list with n_dep elements containing the average population emission distribution(s) of the observations given the hidden states for each of the dependent variables. If data_distr = 'categorical', each element is a matrix with m rows and q_emiss[k] columns for each of the k in n_dep emission distribution(s). That is, the probability of observing category q (column q) in state i (row i). If data_distr = 'continuous', each element is a matrix with m rows and 2 columns; the first column denoting the mean of state i (row i) and the second column denoting the standard deviation of state i (row i) of the Normal distribution.

start_state

Optional numeric vector with length 1 denoting in which state the simulated state sequence should start. If left unspecified, the simulated state for time point 1 is sampled from the initial state distribution (which is derived from the transition probability matrix gamma).

xx_vec

List of 1 + n_dep vectors containing the covariate(s) to predict the transition probability matrix gamma and/or (specific) emission distribution(s) emiss_distr using the regression parameters specified in beta (see below). The first element in the list xx_vec is used to predict the transition matrix. Subsequent elements in the list are used to predict the emission distribution of (each of) the dependent variable(s). This means that the covariate used to predict gamma and emiss_distr can either be the same covariate, different covariates, or a covariate for certain elements and none for the other. At this point, it is only possible to use one covariate for both gamma and emiss_distr. For all elements in the list, the number of observations in the vectors should be equal to the number of subjects to be simulated n. If xx_vec is omitted completely, xx_vec defaults to NULL, resembling no covariates at all. Specific elements in the list can also be left empty (i.e., set to NULL) to signify that either the transition probability matrix or (one of) the emission distribution(s) is not predicted by covariates.

beta

List of 1 + n_dep matrices containing the regression parameters to predict gamma and/or emiss_distr in combination with xx_vec using (Multinomial logistic) regression. The first matrix is used to predict the transition probability matrix gamma. The subsequent matrices are used to predict the emission distribution(s) emiss_distr of the dependent variable(s). For gamma and categorical emission distributions, one regression parameter is specified for each element in gamma and emiss_distr, with the following exception. The first element in each row of gamma and/or emiss_distr is used as reference category in the Multinomial logistic regression. As such, no regression parameters can be specified for these parameters. Hence, the first element in the list beta to predict gamma consist of a matrix with the number of rows equal to m and the number of columns equal to m - 1. For categorical emission distributions, the subsequent elements in the list beta to predict emiss_distr consist of a matrix with the number of rows equal to m and the number of columns equal to q_emiss[k] - 1 for each of the k in n_dep emission distribution(s). See details for more information. For continuous emission distributions, the subsequent elements in the list beta consist of a matrix with the number of rows equal to m and 1 column.

Note that if beta is specified, xx_vec has to be specified as well. If beta is omitted completely, beta defaults to NULL, resembling no prediction of gamma and emiss_distr using covariates. One of the elements in the list can also be left empty (i.e., set to NULL) to signify that either the transition probability matrix or a specific emission distribution is not predicted by covariates.

var_gamma

A numeric vector with length 1 denoting the amount of variance between subjects in the transition probability matrix. Note that this value corresponds to the variance of the parameters of the Multinomial distribution (i.e., the intercepts of the regression equation of the Multinomial distribution used to sample the transition probability matrix), see details below. In addition, only one variance value can be specified for the complete transition probability matrix, hence the variance is assumed fixed across all components. The default equals 0.1, which corresponds to little variation between subjects. If one wants to simulate data from exactly the same HMM for all subjects, var_gamma should be set to 0. Note that if data for only 1 subject is simulated (i.e., n = 1), var_gamma is set to 0.

var_emiss

A numeric vector with length n_dep denoting the amount of variance between subjects in the emission distribution(s). For categorical data, this value corresponds to the variance of the parameters of the Multinomial distribution (i.e., the intercepts of the regression equation of the Multinomial distribution used to sample the components of the emission distribution), see details below. For continuous data, this value corresponds to the variance in the mean of the emission distribution(s) across subjects. Note that only one variance value can be specified each emission distribution, hence the variance is assumed fixed across states (and, for the categorical distribution, categories within a state) within an emission distribution. The default equals 0.1, which corresponds to little variation between subjects given categorical observations. If one wants to simulate data from exactly the same HMM for all subjects, var_emiss should be set to a vector of 0's. Note that if data for only 1 subject is simulated (i.e., n = 1), var_emiss is set to a vector of 0's.

return_ind_par

A logical scalar. Should the subject specific transition probability matrix gamma and emission probability matrix emiss_distr be returned by the function (return_ind_par = TRUE) or not (return_ind_par = FALSE). The default equals return_ind_par = FALSE.

m

The argument m is deprecated; please specify using the input parameter gen.

n_dep

The argument n_dep is deprecated; please specify using the input parameter gen.

q_emiss

The argument q_emiss is deprecated; please specify using the input parameter gen (only to be specified when simulating categorical data).

Details

In simulating the data, having a multilevel structure means that the parameters for each subject are sampled from the population level distribution of the corresponding parameter. The user specifies the population distribution for each parameter: the average population transition probability matrix and its variance, and the average population emission distribution and its variance. For now, the variance of the mean population parameters is assumed fixed for all components of the transition probability matrix and for all components of the emission distribution.

One can simulate multivariate data. That is, the hidden states depend on more than 1 observed variable simultaneously. The distributions of multiple dependent variables for multivariate data are assumed to be independent, and all distributions for one dataset have to be of the same type (either categorical or continuous).

Note: the subject specific) initial state distributions (i.e., the probability of each of the states at the first time point) needed to simulate the data are obtained from the stationary distributions of the subject specific transition probability matrices gamma.

beta: As the first element in each row of gamma is used as reference category in the Multinomial logistic regression, the first matrix in the list beta used to predict transition probability matrix gamma has a number of rows equal to m and the number of columns equal to m - 1. The first element in the first row corresponds to the probability of switching from state one to state two. The second element in the first row corresponds to the probability of switching from state one to state three, and so on. The last element in the first row corresponds to the probability of switching from state one to the last state. The same principle holds for the second matrix in the list beta used to predict categorical emission distribution(s) emiss_distr: the first element in the first row corresponds to the probability of observing category two in state one. The second element in the first row corresponds to the probability of observing category three is state one, and so on. The last element in the first row corresponds to the probability of observing the last category in state one.

Value

The following components are returned by the function sim_mHMM:

states

A matrix containing the simulated hidden state sequences, with one row per hidden state per subject. The first column indicates subject id number. The second column contains the simulated hidden state sequence, consecutively for all subjects. Hence, the id number is repeated over the rows (with the number of repeats equal to the length of the simulated hidden state sequence T for each subject).

obs

A matrix containing the simulated observed outputs, with one row per simulated observation per subject. The first column indicates subject id number. The second column contains the simulated observation sequence, consecutively for all subjects. Hence, the id number is repeated over rows (with the number of repeats equal to the length of the simulated observation sequence T for each subject).

gamma

A list containing n elements with the simulated subject specific transition probability matrices gamma. Only returned if return_ind_par is set to TRUE.

emiss_distr

A list containing n elements with the simulated subject specific emission probability matrices emiss_distr. Only returned if return_ind_par is set to TRUE.

See Also

mHMM for analyzing multilevel hidden Markov data.

Examples

# simulating data for 10 subjects with each 100 categorical observations
n_t     <- 100
n       <- 10
m       <- 3
n_dep   <- 1
q_emiss <- 4
gamma   <- matrix(c(0.8, 0.1, 0.1,
                    0.2, 0.7, 0.1,
                    0.2, 0.2, 0.6), ncol = m, byrow = TRUE)
emiss_distr <- list(matrix(c(0.5, 0.5, 0.0, 0.0,
                             0.1, 0.1, 0.8, 0.0,
                             0.0, 0.0, 0.1, 0.9), nrow = m, ncol = q_emiss, byrow = TRUE))
data1 <- sim_mHMM(n_t = n_t, n = n, gen = list(m = m, n_dep = n_dep, q_emiss = q_emiss),
                  gamma = gamma, emiss_distr = emiss_distr, var_gamma = 1, var_emiss = 1)
head(data1$obs)
head(data1$states)

# including a covariate to predict (only) the transition probability matrix gamma
beta      <- rep(list(NULL), 2)
beta[[1]] <- matrix(c(0.5, 1.0,
                     -0.5, 0.5,
                      0.0, 1.0), byrow = TRUE, ncol = 2)
xx_vec      <- rep(list(NULL),2)
xx_vec[[1]] <-  c(rep(0,5), rep(1,5))
data2 <- sim_mHMM(n_t = n_t, n = n, gen = list(m = m, n_dep = n_dep, q_emiss = q_emiss),
                  gamma = gamma, emiss_distr = emiss_distr, beta = beta, xx_vec = xx_vec,
                  var_gamma = 1, var_emiss = 1)


# simulating subject specific transition probability matrices and emission distributions only
n_t <- 0
n <- 5
m <- 3
n_dep   <- 1
q_emiss <- 4
gamma <- matrix(c(0.8, 0.1, 0.1,
                  0.2, 0.7, 0.1,
                  0.2, 0.2, 0.6), ncol = m, byrow = TRUE)
emiss_distr <- list(matrix(c(0.5, 0.5, 0.0, 0.0,
                             0.1, 0.1, 0.8, 0.0,
                             0.0, 0.0, 0.1, 0.9), nrow = m, ncol = q_emiss, byrow = TRUE))
data3 <- sim_mHMM(n_t = n_t, n = n, gen = list(m = m, n_dep = n_dep, q_emiss = q_emiss),
                  gamma = gamma, emiss_distr = emiss_distr, var_gamma = 1, var_emiss = 1)
data3

data4 <- sim_mHMM(n_t = n_t, n = n, gen = list(m = m, n_dep = n_dep, q_emiss = q_emiss),
                  gamma = gamma, emiss_distr = emiss_distr, var_gamma = .5, var_emiss = .5)
data4

# simulating multivariate continuous data
n_t     <- 100
n       <- 10
m       <- 3
n_dep   <- 2

gamma   <- matrix(c(0.8, 0.1, 0.1,
                    0.2, 0.7, 0.1,
                    0.2, 0.2, 0.6), ncol = m, byrow = TRUE)

emiss_distr <- list(matrix(c( 50, 10,
                              100, 10,
                              150, 10), nrow = m, byrow = TRUE),
                    matrix(c(5, 2,
                             10, 5,
                             20, 3), nrow = m, byrow = TRUE))

data_cont <- sim_mHMM(n_t = n_t, n = n, data_distr = 'continuous',
                      gen = list(m = m, n_dep = n_dep),
                      gamma = gamma, emiss_distr = emiss_distr,
                      var_gamma = .5, var_emiss = c(.5, 0.01))

head(data_cont$states)
head(data_cont$obs)

mHMMbayes documentation built on Oct. 2, 2023, 5:06 p.m.