struct_em: Learn the structure and the parameters of a Gaussian mixture...
In gmgm: Gaussian Mixture Graphical Model Learning and Inference

struct_em

R Documentation

Learn the structure and the parameters of a Gaussian mixture graphical model with incomplete data

Description

This function learns the structure and the parameters of a Gaussian mixture graphical model with incomplete data using the structural EM algorithm. At each iteration, the parametric EM algorithm is performed to complete the data and update the parameters (E step). The completed data are then used to update the structure (M step), and so on. Each iteration is guaranteed to increase the scoring function until convergence to a local maximum (Koller and Friedman, 2009). In practice, due to the sampling process inherent in particle-based inference, it may happen that the monotonic increase no longer occurs when approaching the local maximum, resulting in an earlier termination of the algorithm.

Usage

struct_em(
  gmgm,
  data,
  nodes = structure(gmgm)$nodes,
  arcs_cand = tibble(lag = 0),
  col_seq = NULL,
  score = "bic",
  n_part = 1000,
  max_part_sim = 1e+06,
  min_ess = 1,
  max_iter_sem = 5,
  max_iter_pem = 5,
  verbose = FALSE,
  ...
)

Arguments

`gmgm`	An object of class `gmbn` (non-temporal) or `gmdbn`.
`data`	A data frame containing the data used for learning. Its columns must explicitly be named after nodes of `gmgm` and can contain missing values (columns with no value can be removed).
`nodes`	A character vector containing the nodes whose local conditional models are learned (by default all the nodes of `gmgm`). If `gmgm` is a `gmdbn` object, the same nodes are learned for each of its `gmbn` elements. This constraint can be overcome by passing a list of character vectors named after some of these elements (`b_1`, ...) and containing learned nodes specific to them.
`arcs_cand`	A data frame containing the candidate arcs for addition or removal (by default all possible non-temporal arcs). The column `from` describes the start node, the column `to` the end node and the column `lag` the time lag between them. Missing values in `from` or `to` are interpreted as "all possible nodes", which allows to quickly define large set of arcs that share common attributes. Missing values in `lag` are replaced by 0. If `gmgm` is a `gmdbn` object, the same candidate arcs are used for each of its `gmbn` elements. This constraint can be overcome by passing a list of data frames named after some of these elements (`b_1`, ...) and containing candidate arcs specific to them. If arcs already in `gmgm` are not candidates, they cannot be removed. Therefore, setting `arcs_cand` to `NULL` is equivalent to learning only the mixture structure (and the parameters) of the model.
`col_seq`	A character vector containing the column names of `data` that describe the observation sequence. If `NULL` (the default), all the observations belong to a single sequence. If `gmgm` is a `gmdbn` object, the observations of a same sequence must be ordered such that the tth one is related to time slice t (note that the sequences can have different lengths). If `gmgm` is a `gmbn` object, this argument is ignored.
`score`	A character string (`"aic"`, `"bic"` or `"loglik"`) corresponding to the scoring function.
`n_part`	A positive integer corresponding to the number of particles generated for each observation (if `gmgm` is a `gmbn` object) or observation sequence (if `gmgm` is a `gmdbn` object) during inference.
`max_part_sim`	An integer greater than or equal to `n_part` corresponding to the maximum number of particles that can be processed simultaneously during inference. This argument is used to prevent memory overflow, dividing `data` into smaller subsets that are handle sequentially.
`min_ess`	A numeric value in [0, 1] corresponding to the minimum ESS (expressed as a proportion of `n_part`) under which the renewal step of sequential importance resampling is performed. If `1` (the default), this step is performed at each time slice. If `gmgm` is a `gmbn` object, this argument is ignored.
`max_iter_sem`	A non-negative integer corresponding to the maximum number of iterations.
`max_iter_pem`	A non-negative integer corresponding to the maximum number of iterations of the parametric EM algorithm.
`verbose`	A logical value indicating whether iterations in progress are displayed.
`...`	Additional arguments passed to function `stepwise`.

Value

A list with elements:

`gmgm`	The final `gmbn` or `gmdbn` object (with the highest score).
`data`	A data frame (tibble) containing the complete data used to learn the final `gmbn` or `gmdbn` object.
`seq_score`	A numeric matrix containing the sequence of scores measured after the E and M steps of each iteration.

References

Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. The MIT Press.

Examples


set.seed(0)
data(data_body)
data_1 <- data_body
data_1$GENDER[sample.int(2148, 430)] <- NA
data_1$AGE[sample.int(2148, 430)] <- NA
data_1$HEIGHT[sample.int(2148, 430)] <- NA
data_1$WEIGHT[sample.int(2148, 430)] <- NA
data_1$FAT[sample.int(2148, 430)] <- NA
data_1$WAIST[sample.int(2148, 430)] <- NA
data_1$GLYCO[sample.int(2148, 430)] <- NA
gmbn_1 <- add_nodes(NULL,
                    c("AGE", "FAT", "GENDER", "GLYCO", "HEIGHT", "WAIST",
                      "WEIGHT"))
arcs_cand_1 <- data.frame(from = c("AGE", "GENDER", "HEIGHT", "WEIGHT", NA,
                                   "AGE", "GENDER", "AGE", "FAT", "GENDER",
                                   "HEIGHT", "WEIGHT", "AGE", "GENDER",
                                   "HEIGHT"),
                          to = c("FAT", "FAT", "FAT", "FAT", "GLYCO", "HEIGHT",
                                 "HEIGHT", "WAIST", "WAIST", "WAIST", "WAIST",
                                 "WAIST", "WEIGHT", "WEIGHT", "WEIGHT"))
res_learn_1 <- struct_em(gmbn_1, data_1, arcs_cand = arcs_cand_1,
                         verbose = TRUE, max_comp = 3)

set.seed(0)
data(data_air)
data_2 <- data_air
data_2$NO2[sample.int(7680, 1536)] <- NA
data_2$O3[sample.int(7680, 1536)] <- NA
data_2$TEMP[sample.int(7680, 1536)] <- NA
data_2$WIND[sample.int(7680, 1536)] <- NA
gmdbn_1 <- gmdbn(b_2 = add_nodes(NULL, c("NO2", "O3", "TEMP", "WIND")),
                 b_13 = add_nodes(NULL, c("NO2", "O3", "TEMP", "WIND")))
arcs_cand_2 <- data.frame(from = c("NO2", "NO2", "NO2", "O3", "TEMP", "TEMP",
                                   "WIND", "WIND"),
                          to = c("NO2", "O3", "O3", "O3", NA, NA, NA, NA),
                          lag = c(1, 0, 1, 1, 0, 1, 0, 1))
res_learn_2 <- struct_em(gmdbn_1, data_2, arcs_cand = arcs_cand_2,
                         col_seq = "DATE", verbose = TRUE, max_comp = 3)

gmgm documentation built on Sept. 9, 2022, 1:07 a.m.