In queelius/masked.data: Estimating Reliability of Components in Series from Masked Data with Relaxed Candidate Set Models

knitr::opts_chunk$set(echo = TRUE)

library(algebraic.mle)
library(dplyr)
library(ggplot2)
library(reshape2)
library(knitr)
library(md.tools)
library(md.series.system)
library(utils)
library(tidyverse)
library(gridExtra)

Introduction

In the real world, systems are quite complex:

They are not series systems.
The components are not independent.
The lifetimes of the systems (in the population) is not precisely modeled by any known probability distribution.
It may not be easy to characterize a system as either in a failed state or a non-failed state, and failure may only be transient.
The components may depend on many other unobserved factors.

With all of these caveats in mind, we model the data as coming from a series system as described previously, and other factors, like ambient temperature, are either negligible (on the distribution of component lifetimes) or are more or less constant and so we model the component lifetimes under those conditions. Then, the process of parametrically modeling the observed data takes the following form:

Visualize the data, e.g., plot a histogram of the data.
Guess which parametric distribution (for the components) might fit the observed data for the system lifetime.
Use a statistical test for goodness-of-fit.
Repeat steps 2 and 3 if the measure of goodness of fit is not satisfactory.

Steps 1 and 3 are trivial to do, but step 2 may be very difficult, particularly in our case since a histogram of the data is probably not that informative. Why? There are two reasons:

The distribution of the system is a function of the distribution of the components. The system distribution probably does not even have a name.
The histogram is of the system lifetime data, but the distributions we guess are for the components.

Simulation study

It's good to focus on a few key performance measures and explore the behavior of your estimator. Including the comparison of the estimated variance-covariance matrix to the "true" variance-covariance matrix is also a valuable addition. Here's a brief outline of how to implement these measures in your simulation study:

Bias, Variance, and MSE: For each simulated dataset, obtain the MLE estimates of the parameters, and compute the bias, variance, and MSE as follows:

a. Bias: Calculate the difference between the MLE estimates and the true parameter values for each dataset. Then compute the average difference across all datasets.

b. Variance: Compute the variance of the MLE estimates across all datasets.

c. MSE: Calculate the squared difference between the MLE estimates and the true parameter values for each dataset. Then compute the average squared difference across all datasets.

Confidence Interval Coverage: For each simulated dataset, construct confidence intervals for the parameters using the estimated variance-covariance matrix (e.g., using the Fisher Information Matrix, FIM). Calculate the proportion of datasets for which the true parameter values fall within the constructed confidence intervals.

Sensitivity Analysis by Model Specification: Analyze the robustness of your estimator by changing the true parameter values in your simulations. Observe how the performance measures (bias, variance, MSE, confidence interval coverage) are affected by these changes.

Visualizations: Create plots of the performance measures against factors such as sample size, censoring percentage, and true parameter values. This will help you understand how the estimator's performance changes under different conditions.

Implementing these performance measures and analyses in your simulation study will provide a comprehensive understanding of your MLE estimator's behavior and accuracy in estimating the parameters of lifetime components in a series system with competing risks. It will also help you identify potential limitations and sources of error, which can be valuable for future research and improvement.

# this is the actual code i ran to generate the data set for
# `exp_experiment_3.csv`. this was done before i generalized the approach.
# i'm keeping all of the code the same since this is actually what was run
# to generate the data set, but it should be the same as what i posted
# below in block `run-exp-experiment-3`

generate_data <- function(n, theta, tau, p) {
    m <- length(theta)
    comp_times <- matrix(nrow=n,ncol=m)
    for (j in 1:m)
        comp_times[,j] <- rexp(n,theta[j])
    comp_times <- md_encode_matrix(comp_times,"t")

    comp_times %>%
        md_series_lifetime_right_censoring(tau) %>%
        md_bernoulli_cand_C1_C2_C3(p) %>%
        md_cand_sampler()
}

exp_experiment_3_gen <- function(
    csv_filename,
    R = 1000,
    p = .333,
    q = .25,
    sample_sizes = c(
        5, 10, 15, 20, 25, 30, 35, 40,
        45, 50, 60, 70, 80, 90, 100, 250, 500),
    append = TRUE,
    use_aneal_start = TRUE) {

    set.seed(7231) # set seed for reproducibility
    theta <- c(1,  # component 1 failure rate
            1.1,   # 2
            0.975, # 3
            1.125, # 4
            1.1,   # 5
            1.0,   # 6
            1.05)  # 7

    m <- length(theta)
    tau <- -(1/sum(theta))*log(q)

    if (!append) {
        cnames <- c("R", "p", "tau", "N", paste0("bias",1:m), "mse",
            paste0("se",1:m), paste0("se_asym",1:m), "mse_asym", "mse_asym_hat",
            paste0("coverage",1:m))

        # Write column names first
        write.table(t(cnames), file = csv_filename,
            sep = ",", col.names = FALSE,
            row.names = FALSE, append = FALSE, quote = FALSE)
    }

    for (i in 1:length(sample_sizes)) {
        N <- sample_sizes[i]
        cat("Starting simulations for sample size", N, "\n")
        mles <- matrix(nrow = R, ncol = m)

        # For storing the lower and upper bounds of the confidence intervals
        CI_lwr <- matrix(nrow = R, ncol = m)
        CI_upr <- matrix(nrow = R, ncol = m)

        j <- 1L
        repeat {
            data <- generate_data(N, theta, tau, p)
            theta.hat <- custom_solver(
                data = data,
                theta = theta,
                extra_info = paste0("Replicate(", j, ")"),
                annealing = use_aneal_start)
            if (is.null(theta.hat)) {
                next
            }
            if (j %% 10 == 0) {
                cat("Sample size", N, " | Replicate ", j,
                    " | MLE ", point(theta.hat), "\n")
            }
            mles[j, ] <- point(theta.hat)

            CI <- confint(theta.hat)
            if (any(is.nan(CI)))
            {
                print("NaN in CI")
                print(summary(theta.hat))
            }
            CI_lwr[j, ] <- CI[,1]
            CI_upr[j, ] <- CI[,2]

            j <- j + 1L
            if (j > R) {
                break
            }
        }

        # compute asymptotics
        theta.mle <- NULL
        while (is.null(theta.mle)) {
            data <- generate_data(N, theta, tau, p)
            theta.mle <- custom_solver(
                data = data,
                theta = theta,
                extra_info = "asymptotics",
                annealing = use_aneal_start)
        }
        SE.asym <- se(theta.mle)
        MSE.asym <- mse(theta.hat, theta)
        MSE.asym.hat <- mse(theta.mle)

        # Calculate bias, MSE, and SE for each parameter
        bias <- colMeans(mles) - theta
        MSE <- sum(colMeans((mles - theta)^2))
        sigma <- cov(mles) * ((N - 1) / N) # MLE of variance-covariance matrix
        SE <- sqrt(diag(sigma))

        # Compute coverage probabilities
        coverage <- colMeans((CI_lwr <= theta) & (theta <= CI_upr))

        datum <- c(R, p, tau, N, bias, MSE, SE, SE.asym,
            MSE.asym, MSE.asym.hat, coverage)

        write.table(t(datum), file = csv_filename, sep = ",", col.names = FALSE,
            row.names = FALSE, append = TRUE)
    }
}

custom_solver <- function(data, theta, extra_info = NULL, annealing = TRUE) {
    ll <- md_loglike_exp_series_C1_C2_C3(data)
    ll.grad <- md_score_exp_series_C1_C2_C3(data)
    fish <- md_fim_exp_series_C1_C2_C3(data)
    ll.hess <- function(x) -fish(x)
    theta.hat <- NULL
    start <- NULL
    m <- length(theta)

    tryCatch({
        start <- list(par = theta)
        if (annealing) {
            start <- sim_anneal(par = theta, fn = ll, control =
                list(fnscale = -1, maxit = 100000L, trace = FALSE,
                    t_init = 20, 1e-3, alpha = 0.95,
                    it_per_temp = 50L,
                    neigh = function(par, temp, ...) {
                        tt <- min(temp, 1)
                        par + rnorm(m, 0, tt)
                    },
                    sup = function(x) all(x > 0)))
        }
        res_optim <- optim(
            par = start$par,
            fn = ll,
            gr = ll.grad,
            method = "L-BFGS-B",
            lower = 1e-30,
            hessian = FALSE,
            control = list(fnscale = -1, maxit = 1000L))
        theta.hat <- mle_numerical(res_optim,
            options = list(hessian = ll.hess(res_optim$par)))
    }, error = function(e) {
        cat("Sample size", nrow(data), " | Anneal:",
            start$par, " | ", e$message)
        if (!is.null(extra_info)) {
            cat(" | ", extra_info)
        }
        cat("\n")
    })
    theta.hat
}

exp_experiment_3_gen(
    csv_filename = "exp_experiment_3.csv",
    #R = 1000, p = .333, q = .25,
    append = FALSE,
    use_aneal_start = TRUE)

The purpose of this data set is to analyze the sensivity of the exponential series system (7 components, all roughly the same failure rate) with respect to a sample size.

Here is how we generated this data set (we do not evaluate this code, since we already generated the data set and saved it to a file):

source("exp_experiment_gen.R")
exp_experiment_gen(
    R = 1000,
    bernoulli_probs = .333,
    quants = .25,
    sample_sizes = c(5, 10, 15, 20, 25, 30, 35, 40,
        45, 50, 60, 70, 80, 90, 100, 250, 500, 1000, 2000),
    theta = c(1, 1.1, 0.975, 1.125, 1.1, 1.0, 1.05),
    seed = 7231,
    use_aneal_start = TRUE,
    append = FALSE,
    csv_filename = "exp_experiment_3.csv")

We read the data set from the file with:

df <- read.csv("exp_experiment_3.csv")

kable(df[1:5, 1:10])
kable(df[1:5, 11:20])
kable(df[1:5, 21:ncol(df)])

Coverage Probability

df_long <- df %>% pivot_longer(
    cols = starts_with("coverage"),
    names_to = "Coverage",
    values_to = "Value")

# Convert the "Coverage" column to a factor for better plotting
df_long$Coverage <- as.factor(df_long$Coverage)

# Plot, with a solid line for the asymptotic coverage probability
# for 0.95% confidence level
ggplot(df_long, aes(x = N, y = Value, color = Coverage)) +
  geom_point() +
  geom_line() +
  geom_hline(yintercept = 0.95, linetype = "dashed") +
  labs(x = "Sample Size (N)", y = "Coverage Probability",
    title = "Coverage Probability vs Sample Size", color = "Coverage")

Analysis

Analysis here.

Bias

############# BIAS #############
df_long <- df %>% pivot_longer(
    cols = starts_with("bias"),
    names_to = "Bias",
    values_to = "Value")

# Convert the "Bias" column to a factor for better plotting
df_long$Bias <- as.factor(df_long$Bias)

# Plot, with a solid blue line for the asymptotic zero bias
ggplot(df_long, aes(x = N, y = Value, color = Bias)) +
  geom_point() +
  geom_line() +
  geom_hline(yintercept = 0, color = "blue", linetype = "dashed") +
  labs(x = "Sample Size (N)", y = "Bias",
       title = "Bias vs Sample Size", color = "Bias")

Analysis

Analysis here.

Mean Squared Error

cutoff <- 70
df_small <- df %>% filter(N <= cutoff)
df_large <- df %>% filter(N > cutoff)

p1 <- ggplot(df_small) +
  geom_point(aes(x = N, y = mse)) +
  geom_line(aes(x = N, y = mse,
                color = "Simulation")) +
  geom_line(aes(x = N, y = mse_asym,
                color = "Asymptotic"),
            linetype = "dashed") +
  labs(x = "Small Sample Size", y = "Mean Squared Error",
       title = "Mean Squared Error vs Small Sample Size")

p2 <- ggplot(df_large) +
  geom_point(aes(x = N, y = mse)) +
  geom_line(aes(x = N, y = mse,
                color = "Simulation")) +
  geom_line(aes(x = N, y = mse_asym,
                color = "Asymptotic"),
            linetype = "dashed") +
  labs(x = "Large Sample Size", y = "Mean Squared Error",
       title = "Mean Squared Error vs Large Sample Size")
library(gridExtra)
grid.arrange(p1, p2, ncol = 2)

Analysis

Analysis here.

Standard Error

df_long <- df %>% pivot_longer(
    # regex to match column names "se<digit>"
    cols = matches("se[0-9]"),
    names_to = "SE",
    values_to = "Value")

# Convert the "SE" column to a factor for better plotting
df_long$SE <- as.factor(df_long$SE)

# Combine the two plots using facet
df_long$Group <- ifelse(df_long$N <= 150, "N < 150", "N >= 150")
ggplot(df_long, aes(x = N, y = Value, color = SE)) +
    geom_point() +
    geom_line() +
    facet_wrap(~ Group, scales = "free") +
    labs(x = "Sample Size (N)", y = "SE",
        title = "SE vs Sample Size", color = "SE")

Analysis

Analysis here.

Ratio of SE to Asymptotic SE

cutoff <- 70

df_ratio <- df %>% mutate(
    Ratio1 = se_asym1/se1,
    Ratio2 = se_asym2/se2,
    Ratio3 = se_asym3/se3,
    Ratio4 = se_asym4/se4,
    Ratio5 = se_asym5/se5,
    Ratio6 = se_asym6/se6,
    Ratio7 = se_asym7/se7) %>%
    select(matches("Ratio[0-9]"), N) %>%
    mutate(mean_ratio = rowMeans(select(., matches("Ratio[0-9]"))))
df_ratio_small <- df_ratio %>% filter(N <= cutoff)
df_ratio_large <- df_ratio %>% filter(N > cutoff)

p3 <- ggplot(df_ratio_small, aes(x = N, y = mean_ratio)) +
  geom_point() +
  geom_line() +
  geom_hline(yintercept = 1, linetype = "dashed") +
  ylim(0, 7) +
  labs(x = "Sample Size (N)", y = "Mean Ratio of Asymptotic to Simulated SE",
       title = "Mean Ratio of Asymptotic to Simulated SE vs Small Sample Size")

p4 <- ggplot(df_ratio_large, aes(x = N, y = mean_ratio)) +
  geom_point() +
  geom_line() +
  geom_hline(yintercept = 1, linetype = "dashed") +
  ylim(0, 7) +
  labs(x = "Sample Size (N)", y = "Mean Ratio of Asymptotic to Simulated SE",
       title = "Mean Ratio of Asymptotic to Simulated SE vs Large Sample Size")

df_ratio_long <- df_ratio %>%
    pivot_longer(
        cols = matches("Ratio[0-9]"),
        names_to = "Ratio",
        values_to = "Value"
    )


# Split df_ratio_long into two dataframes by sample size
df_ratio_long_small <- df_ratio_long %>% filter(N < cutoff)
df_ratio_long_large <- df_ratio_long %>% filter(N >= cutoff)

# Create the two plots
p1 <- ggplot(df_ratio_long_small, aes(x = N, y = Value, color = Ratio)) +
  geom_point() +
  geom_line() +
  labs(x = "Small Sample Size",
       y = "Ratio of Asymptotic to Simulated SE",
       title = "Ratio of Asymptotic to Monte Carlo SE vs Small Samples",
       color = "Ratio")

p2 <- ggplot(df_ratio_long_large, aes(x = N, y = Value, color = Ratio)) +
  geom_point() +
  geom_line() +
  labs(x = "Large Sample Size",
       y = "Ratio of Asymptotic to Simulated SE",
       title = "Ratio of Asymptotic to Monte Carlo SE vs Large Samples",
       color = "Ratio")

# Arrange the plots side by side
grid.arrange(p1, p2, p3, p4, ncol = 2, nrow = 2)

Analysis

Analysis here.

queelius/masked.data documentation built on Jan. 28, 2025, 4:23 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

queelius/masked.data
Estimating Reliability of Components in Series from Masked Data with Relaxed Candidate Set Models

In queelius/masked.data: Estimating Reliability of Components in Series from Masked Data with Relaxed Candidate Set Models

Introduction

Simulation study

Coverage Probability

Analysis

Bias

Analysis

Mean Squared Error

Analysis

Standard Error

Analysis

Ratio of SE to Asymptotic SE

Analysis

R Package Documentation

Browse R Packages

We want your feedback!

queelius/masked.data Estimating Reliability of Components in Series from Masked Data with Relaxed Candidate Set Models

In queelius/masked.data: Estimating Reliability of Components in Series from Masked Data with Relaxed Candidate Set Models

Introduction

Simulation study

Coverage Probability

Analysis

Bias

Analysis

Mean Squared Error

Analysis

Standard Error

Analysis

Ratio of SE to Asymptotic SE

Analysis

R Package Documentation

Browse R Packages

We want your feedback!

queelius/masked.data
Estimating Reliability of Components in Series from Masked Data with Relaxed Candidate Set Models