#' Maximum Likelihood Estimation (MLE) of a semi-Markov chain
#'
#' @description Maximum Likelihood Estimation of a semi-Markov chain starting
#' from one or several sequences. This estimation can be parametric or
#' non-parametric, non-censored, censored at the beginning and/or at the end
#' of the sequence, with one or several trajectories. Several parametric
#' distributions are considered (Uniform, Geometric, Poisson, Discrete
#' Weibull and Negative Binomial).
#'
#' @details This function estimates a semi-Markov model in parametric or
#' non-parametric case, taking into account the type of sojourn time and the
#' censoring described in references. The non-parametric estimation concerns
#' sojourn time distributions defined by the user. For the parametric
#' estimation, several discrete distributions are considered (see below).
#'
#' The difference between the Markov model and the semi-Markov model concerns
#' the modeling of the sojourn time. With a Markov chain, the sojourn time
#' distribution is modeled by a Geometric distribution (in discrete time).
#' With a semi-Markov chain, the sojourn time can be any arbitrary distribution.
#' In this package, the available distribution for a semi-Markov model are :
#' \itemize{
#' \item Uniform: \eqn{f(x) = \frac{1}{n}} for \eqn{1 \le x \le n}. \eqn{n} is the parameter;
#' \item Geometric: \eqn{f(x) = \theta (1-\theta)^{x - 1}} for \eqn{x = 1, 2,\ldots,n}, \eqn{0 < \theta < 1}, \eqn{\theta} is the probability of success.
#' \eqn{\theta} is the parameter;
#' \item Poisson: \eqn{f(x) = \frac{\lambda^x exp(-\lambda)}{x!}} for \eqn{x = 0, 1, 2,\ldots,n}, with \eqn{\lambda > 0}.
#' \eqn{\lambda} is the parameter;
#' \item Discrete Weibull of type 1: \eqn{f(x)=q^{(x-1)^{\beta}}-q^{x^{\beta}}}, \eqn{x = 1, 2,\ldots,n},
#' with \eqn{0 < q < 1}, the first parameter and \eqn{\beta > 0} the second parameter.
#' \eqn{(q, \beta)} are the parameters;
#' \item Negative binomial: \eqn{f(x)=\frac{\Gamma(x+\alpha)}{\Gamma(\alpha) x!} p^{\alpha} (1 - p)^x},
#' for \eqn{x = 0, 1, 2,\ldots,n}, \eqn{\Gamma} is the Gamma function,
#' \eqn{\alpha} is the parameter of overdispersion and \eqn{p} is the
#' probability of success, \eqn{0 < p < 1}. \eqn{(\alpha, p)} are the parameters;
#' \item Non-parametric.
#' }
#'
#'
#' We define :
#' \itemize{
#' \item the semi-Markov kernel \eqn{q_{ij}(k) = P( J_{m+1} = j, T_{m+1} - T_{m} = k | J_{m} = i )};
#' \item the transition matrix \eqn{(p_{trans}(i,j))_{i,j} \in states} of the
#' embedded Markov chain \eqn{J = (J_m)_m}, \eqn{p_{trans}(i,j) = P( J_{m+1} = j | J_m = i )};
#' \item the initial distribution \eqn{\mu_i = P(J_1 = i) = P(Z_1 = i)}, \eqn{i \in 1, 2, \dots, s};
#' \item the conditional sojourn time distributions \eqn{(f_{ij}(k))_{i,j} \in states,\ k \in N ,\ f_{ij}(k) = P(T_{m+1} - T_m = k | J_m = i, J_{m+1} = j )},
#' \eqn{f} is specified by the argument `param` in the parametric case
#' and by `distr` in the non-parametric case.
#' }
#'
#' The maximum likelihood estimation of the transition matrix of the embedded
#' Markov chain is \eqn{\widehat{p_{trans}}(i,j) = \frac{N_{ij}}{N_{i.}}}.
#'
#' Five methods are proposed for the estimation of the initial distribution :
#' \describe{
#' \item{Maximum Likelihood Estimator: }{The Maximum Likelihood Estimator
#' for the initial distribution. The formula is:
#' \eqn{\widehat{\mu_i} = \frac{Nstart_i}{L}}, where \eqn{Nstart_i} is
#' the number of occurences of the word \eqn{i} (of length \eqn{k}) at
#' the beginning of each sequence and \eqn{L} is the number of sequences.
#' This estimator is reliable when the number of sequences \eqn{L} is high.}
#' \item{Limit (stationary) distribution: }{The limit (stationary)
#' distribution of the semi-Markov chain is used as a surrogate of the
#' initial distribution.}
#' \item{Frequencies of each state: }{The initial distribution is replaced
#' by taking the frequencies of each state in the sequences.}
#' \item{Uniform distribution: }{The initial probability of each state is
#' equal to \eqn{1 / s}, with \eqn{s}, the number of states.}
#' }
#'
#' Note that \eqn{q_{ij}(k) = p_{trans}(i,j) \ f_{ij}(k)} in the general case
#' (depending on the present state and on the next state). For particular cases,
#' we replace \eqn{f_{ij}(k)} by \eqn{f_{i.}(k)} (depending on the present
#' state \eqn{i}), \eqn{f_{.j}(k)} (depending on the next state \eqn{j}) and
#' \eqn{f_{..}(k)} (depending neither on the present state nor on the next
#' state).
#'
#' In this package we can choose different types of sojourn time.
#' Four options are available for the sojourn times:
#' \itemize{
#' \item depending on the present state and on the next state (`fij`);
#' \item depending only on the present state (`fi`);
#' \item depending only on the next state (`fj`);
#' \item depending neither on the current, nor on the next state (`f`).
#' }
#'
#' If `type.sojourn = "fij"`, `distr` is a matrix of dimension \eqn{(s, s)}
#' (e.g., if the 1st element of the 2nd column is `"pois"`, that is to say we
#' go from the first state to the second state following a Poisson distribution).
#' If `type.sojourn = "fi"` or `"fj"`, `distr` must be a vector (e.g., if the
#' first element of vector is `"geom"`, that is to say we go from (or to) the
#' first state to (or from) any state according to a Geometric distribution).
#' If `type.sojourn = "f"`, `distr` must be one of `"unif"`, `"geom"`, `"pois"`,
#' `"dweibull"`, `"nbinom"` (e.g., if `distr` is equal to `"nbinom"`, that is
#' to say that the sojourn time when going from one state to another state
#' follows a Negative Binomial distribution).
#' For the non-parametric case, `distr` is equal to `"nonparametric"` whatever
#' type of sojourn time given.
#'
#' If the sequence is censored at the beginning and/or at the end, `cens.beg`
#' must be equal to `TRUE` and/or `cens.end` must be equal to `TRUE`.
#' All the sequences must be censored in the same way.
#'
#' @param sequences A list of vectors representing the sequences.
#' @param states Vector of state space (of length s).
#' @param type.sojourn Type of sojourn time (for more explanations, see Details).
#' @param distr By default `"nonparametric"` for the non-parametric estimation
#' case.
#'
#' If the parametric estimation case is desired, `distr` should be:
#' \itemize{
#' \item A matrix of distributions of dimension \eqn{(s, s)} if `type.sojourn = "fij"`;
#' \item A vector of distributions of length \eqn{s} if `type.sojourn = "fi"` or `"fj"`;
#' \item A distribution if `type.sojourn = "f"`.
#' }
#'
#' The distributions to be used in `distr` must be one of `"unif"`, `"geom"`,
#' `"pois"`, `"dweibull"`, `"nbinom"`.
#' @param init.estim Optional. `init.estim` gives the method used to estimate
#' the initial distribution. The following methods are proposed:
#' \itemize{
#' \item `init.estim = "mle"`: the classical Maximum Likelihood Estimator
#' is used to estimate the initial distribution `init`;
#' \item `init.estim = "limit"`: the initial distribution is replaced by
#' the limit (stationary) distribution of the semi-Markov chain;
#' \item `init.estim = "freq"`: the initial distribution is replaced by
#' the frequencies of each state in the sequences;
#' \item `init.estim = "unif"`: the initial probability of each state is
#' equal to \eqn{1 / s}, with \eqn{s} the number of states.
#' }
#' @param cens.beg Optional. A logical value indicating whether or not
#' sequences are censored at the beginning.
#' @param cens.end Optional. A logical value indicating whether or not
#' sequences are censored at the end.
#' @return Returns an object of S3 class `smmfit` (inheriting from the S3
#' class `smm` and [smmnonparametric] class if `distr = "nonparametric"`
#' or [smmparametric] otherwise). The S3 class `smmfit` contains:
#' \itemize{
#' \item All the attributes of the S3 class [smmnonparametric] or
#' [smmparametric] depending on the type of estimation;
#' \item An attribute `M` which is an integer giving the total length of
#' the set of sequences `sequences` (sum of all the lengths of the list
#' `sequences`);
#' \item An attribute `logLik` which is a numeric value giving the value
#' of the log-likelihood of the specified semi-Markov model based on the
#' `sequences`;
#' \item An attribute `sequences` which is equal to the parameter
#' `sequences` of the function `fitsmm` (i.e. a list of sequences used to
#' estimate the Markov model).
#' }
#'
#' @seealso [smmnonparametric], [smmparametric], [simulate.smm],
#' [simulate.smmfit], [plot.smm], [plot.smmfit]
#'
#' @references
#' V. S. Barbu, N. Limnios. (2008). Semi-Markov Chains and Hidden Semi-Markov
#' Models Toward Applications - Their Use in Reliability and DNA Analysis.
#' New York: Lecture Notes in Statistics, vol. 191, Springer.
#'
#' @export
#'
#' @examples
#' states <- c("a", "c", "g", "t")
#' s <- length(states)
#'
#' # Creation of the initial distribution
#' vect.init <- c(1 / 4, 1 / 4, 1 / 4, 1 / 4)
#'
#' # Creation of the transition matrix
#' pij <- matrix(c(0, 0.2, 0.5, 0.3,
#' 0.2, 0, 0.3, 0.5,
#' 0.3, 0.5, 0, 0.2,
#' 0.4, 0.2, 0.4, 0),
#' ncol = s, byrow = TRUE)
#'
#' # Creation of the distribution matrix
#'
#' distr.matrix <- matrix(c(NA, "pois", "geom", "nbinom",
#' "geom", NA, "pois", "dweibull",
#' "pois", "pois", NA, "geom",
#' "pois", "geom", "geom", NA),
#' nrow = s, ncol = s, byrow = TRUE)
#'
#' # Creation of an array containing the parameters
#' param1.matrix <- matrix(c(NA, 2, 0.4, 4,
#' 0.7, NA, 5, 0.6,
#' 2, 3, NA, 0.6,
#' 4, 0.3, 0.4, NA),
#' nrow = s, ncol = s, byrow = TRUE)
#'
#' param2.matrix <- matrix(c(NA, NA, NA, 0.6,
#' NA, NA, NA, 0.8,
#' NA, NA, NA, NA,
#' NA, NA, NA, NA),
#' nrow = s, ncol = s, byrow = TRUE)
#'
#' param.array <- array(c(param1.matrix, param2.matrix), c(s, s, 2))
#'
#' # Specify the semi-Markov model
#' semimarkov <- smmparametric(states = states, init = vect.init, ptrans = pij,
#' type.sojourn = "fij", distr = distr.matrix,
#' param = param.array)
#'
#' seqs <- simulate(object = semimarkov, nsim = c(1000, 10000, 2000), seed = 100)
#'
#' # Estimation of simulated sequences
#' est <- fitsmm(sequences = seqs, states = states, type.sojourn = "fij",
#' distr = distr.matrix)
#'
#' @testexamples
#' expect_equal(length(seqs), 3)
#' expect_true(all(sapply(seqs, length) >= c(1000, 10000, 2000)))
#' expect_equal(seqs[[1]][995:1000], c("t","t","t","t","c","t"))
#'
fitsmm <- function(sequences, states, type.sojourn = c("fij", "fi", "fj", "f"),
distr = "nonparametric", init.estim = "mle", cens.beg = FALSE, cens.end = FALSE) {
#############################
# Checking parameters sequences and states
#############################
if (!(is.list(sequences) & all(sapply(sequences, class) %in% c("character", "numeric")))) {
stop("The parameter 'sequences' should be a list of vectors")
}
if (!all(unique(unlist(sequences)) %in% states)) {
stop("Some states in the list of observed sequences 'sequences' are not in the state space 'states'")
}
#############################
# Checking parameter type.sojourn
#############################
type.sojourn <- match.arg(type.sojourn)
#############################
# Checking parameters distr
#############################
s <- length(states) # State space size
if (!all(is.vector(distr), length(distr) == 1, distr == "nonparametric")) {
distrib.vec <- c("unif", "geom", "pois", "dweibull", "nbinom", NA)
if (!all(distr %in% distrib.vec)) {
stop("The specified distributions must be either ", paste(distrib.vec, collapse = ", "),
".\n Incorrect distribution(s) found in 'distr': ", paste(as.character(distr)[!(distr %in% distrib.vec)], collapse = ", "))
}
if (type.sojourn == "fij") {
if (!(is.matrix(distr) & dim(distr)[1] == s & dim(distr)[2] == s)) {
stop("'distr' must be a matrix of dimension (s, s) since 'type.sojourn == \"fij\"'")
}
if (!(all(is.na(diag(distr))))) {
stop("All the diagonal elements of 'distr' must be equal to NA since transitions to the same state are not allowed")
}
}
if (type.sojourn == "fi" | type.sojourn == "fj") {
if (!(is.vector(distr) & length(distr) == s)) {
stop("'distr' must be a vector of length s since 'type.sojourn == \"fi\"' or 'type.sojourn == \"fj\"'")
}
if (anyNA(distr)) {
stop("'distr' cannot contain non specified distributions")
}
}
if (type.sojourn == "f") {
if (!(is.vector(distr) & length(distr) == 1)) {
stop("'distr' must be one distribution since 'type.sojourn == \"f\"'")
}
if (is.na(distr)) {
stop("'distr' must be either ", paste(distrib.vec[-length(distrib.vec)], collapse = ", "))
}
}
}
#############################
# Checking parameter init.estim
#############################
# init.estim <- match.arg(init.estim)
processes <- processesSemiMarkov(sequences = sequences, states = states, verbose = TRUE)
if (all(length(distr) == 1, distr == "nonparametric")) {
if (!(cens.beg)) {
if (processes$L > 1) {# If more than one sequence, use the estimation based on a couple Markov chain
smm <- .fit.nonparam.couplemarkovchain(processes = processes, states = states, type.sojourn = type.sojourn, init.estim = init.estim, cens.end = cens.end)
} else {
if (!cens.end) {
smm <- .fit.nonparam.nocensoring(processes = processes, type.sojourn = type.sojourn, init.estim = init.estim, cens.beg = cens.beg)
} else {
smm <- .fit.nonparam.couplemarkovchain(processes = processes, states = states, type.sojourn = type.sojourn, init.estim = init.estim, cens.end = cens.end)
}
}
} else {
if (!cens.end) {
stop("fitsmm not implemented in the case 'distr = \"nonparametric\"', 'cens.beg = TRUE', 'cens.end = FALSE'")
} else {
stop("fitsmm not implemented in the case 'distr = \"nonparametric\"', 'cens.beg = TRUE', 'cens.end = TRUE'")
}
}
} else {
smm <- .fit.param(processes = processes, states = states, type.sojourn = type.sojourn, distr = distr, init.estim = init.estim, cens.end = cens.end, cens.beg = cens.beg)
}
if (any(smm$init == 0)) {
message("The probabilities of the initial state(s) \"",
paste0(names(which(smm$init == 0)), collapse = "\", \""),
"\" are 0.")
}
logLik <- .logLik(x = smm, processes = processes)
estimate <- smmfit(smm = smm, M = processes$M, logLik = logLik, sequences = sequences)
return(estimate)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.