knitr::opts_chunk$set(echo = TRUE)
library(algebraic.mle) library(dplyr) library(ggplot2) library(reshape2) library(knitr) library(md.tools) library(md.series.system) library(utils) library(tidyverse) library(gridExtra)
In the real world, systems are quite complex:
They are not series systems.
The components are not independent.
The lifetimes of the systems (in the population) is not precisely modeled by any known probability distribution.
It may not be easy to characterize a system as either in a failed state or a non-failed state, and failure may only be transient.
The components may depend on many other unobserved factors.
With all of these caveats in mind, we model the data as coming from a series system as described previously, and other factors, like ambient temperature, are either negligible (on the distribution of component lifetimes) or are more or less constant and so we model the component lifetimes under those conditions. Then, the process of parametrically modeling the observed data takes the following form:
Visualize the data, e.g., plot a histogram of the data.
Guess which parametric distribution (for the components) might fit the observed data for the system lifetime.
Use a statistical test for goodness-of-fit.
Repeat steps 2 and 3 if the measure of goodness of fit is not satisfactory.
Steps 1 and 3 are trivial to do, but step 2 may be very difficult, particularly in our case since a histogram of the data is probably not that informative. Why? There are two reasons:
The distribution of the system is a function of the distribution of the components. The system distribution probably does not even have a name.
The histogram is of the system lifetime data, but the distributions we guess are for the components.
It's good to focus on a few key performance measures and explore the behavior of your estimator. Including the comparison of the estimated variance-covariance matrix to the "true" variance-covariance matrix is also a valuable addition. Here's a brief outline of how to implement these measures in your simulation study:
Bias, Variance, and MSE: For each simulated dataset, obtain the MLE estimates of the parameters, and compute the bias, variance, and MSE as follows:
a. Bias: Calculate the difference between the MLE estimates and the true parameter values for each dataset. Then compute the average difference across all datasets.
b. Variance: Compute the variance of the MLE estimates across all datasets.
c. MSE: Calculate the squared difference between the MLE estimates and the true parameter values for each dataset. Then compute the average squared difference across all datasets.
Confidence Interval Coverage: For each simulated dataset, construct confidence intervals for the parameters using the estimated variance-covariance matrix (e.g., using the Fisher Information Matrix, FIM). Calculate the proportion of datasets for which the true parameter values fall within the constructed confidence intervals.
Sensitivity Analysis by Model Specification: Analyze the robustness of your estimator by changing the true parameter values in your simulations. Observe how the performance measures (bias, variance, MSE, confidence interval coverage) are affected by these changes.
Visualizations: Create plots of the performance measures against factors such as sample size, censoring percentage, and true parameter values. This will help you understand how the estimator's performance changes under different conditions.
Implementing these performance measures and analyses in your simulation study will provide a comprehensive understanding of your MLE estimator's behavior and accuracy in estimating the parameters of lifetime components in a series system with competing risks. It will also help you identify potential limitations and sources of error, which can be valuable for future research and improvement.
# this is the actual code i ran to generate the data set for # `exp_experiment_3.csv`. this was done before i generalized the approach. # i'm keeping all of the code the same since this is actually what was run # to generate the data set, but it should be the same as what i posted # below in block `run-exp-experiment-3` generate_data <- function(n, theta, tau, p) { m <- length(theta) comp_times <- matrix(nrow=n,ncol=m) for (j in 1:m) comp_times[,j] <- rexp(n,theta[j]) comp_times <- md_encode_matrix(comp_times,"t") comp_times %>% md_series_lifetime_right_censoring(tau) %>% md_bernoulli_cand_C1_C2_C3(p) %>% md_cand_sampler() } exp_experiment_3_gen <- function( csv_filename, R = 1000, p = .333, q = .25, sample_sizes = c( 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100, 250, 500), append = TRUE, use_aneal_start = TRUE) { set.seed(7231) # set seed for reproducibility theta <- c(1, # component 1 failure rate 1.1, # 2 0.975, # 3 1.125, # 4 1.1, # 5 1.0, # 6 1.05) # 7 m <- length(theta) tau <- -(1/sum(theta))*log(q) if (!append) { cnames <- c("R", "p", "tau", "N", paste0("bias",1:m), "mse", paste0("se",1:m), paste0("se_asym",1:m), "mse_asym", "mse_asym_hat", paste0("coverage",1:m)) # Write column names first write.table(t(cnames), file = csv_filename, sep = ",", col.names = FALSE, row.names = FALSE, append = FALSE, quote = FALSE) } for (i in 1:length(sample_sizes)) { N <- sample_sizes[i] cat("Starting simulations for sample size", N, "\n") mles <- matrix(nrow = R, ncol = m) # For storing the lower and upper bounds of the confidence intervals CI_lwr <- matrix(nrow = R, ncol = m) CI_upr <- matrix(nrow = R, ncol = m) j <- 1L repeat { data <- generate_data(N, theta, tau, p) theta.hat <- custom_solver( data = data, theta = theta, extra_info = paste0("Replicate(", j, ")"), annealing = use_aneal_start) if (is.null(theta.hat)) { next } if (j %% 10 == 0) { cat("Sample size", N, " | Replicate ", j, " | MLE ", point(theta.hat), "\n") } mles[j, ] <- point(theta.hat) CI <- confint(theta.hat) if (any(is.nan(CI))) { print("NaN in CI") print(summary(theta.hat)) } CI_lwr[j, ] <- CI[,1] CI_upr[j, ] <- CI[,2] j <- j + 1L if (j > R) { break } } # compute asymptotics theta.mle <- NULL while (is.null(theta.mle)) { data <- generate_data(N, theta, tau, p) theta.mle <- custom_solver( data = data, theta = theta, extra_info = "asymptotics", annealing = use_aneal_start) } SE.asym <- se(theta.mle) MSE.asym <- mse(theta.hat, theta) MSE.asym.hat <- mse(theta.mle) # Calculate bias, MSE, and SE for each parameter bias <- colMeans(mles) - theta MSE <- sum(colMeans((mles - theta)^2)) sigma <- cov(mles) * ((N - 1) / N) # MLE of variance-covariance matrix SE <- sqrt(diag(sigma)) # Compute coverage probabilities coverage <- colMeans((CI_lwr <= theta) & (theta <= CI_upr)) datum <- c(R, p, tau, N, bias, MSE, SE, SE.asym, MSE.asym, MSE.asym.hat, coverage) write.table(t(datum), file = csv_filename, sep = ",", col.names = FALSE, row.names = FALSE, append = TRUE) } } custom_solver <- function(data, theta, extra_info = NULL, annealing = TRUE) { ll <- md_loglike_exp_series_C1_C2_C3(data) ll.grad <- md_score_exp_series_C1_C2_C3(data) fish <- md_fim_exp_series_C1_C2_C3(data) ll.hess <- function(x) -fish(x) theta.hat <- NULL start <- NULL m <- length(theta) tryCatch({ start <- list(par = theta) if (annealing) { start <- sim_anneal(par = theta, fn = ll, control = list(fnscale = -1, maxit = 100000L, trace = FALSE, t_init = 20, 1e-3, alpha = 0.95, it_per_temp = 50L, neigh = function(par, temp, ...) { tt <- min(temp, 1) par + rnorm(m, 0, tt) }, sup = function(x) all(x > 0))) } res_optim <- optim( par = start$par, fn = ll, gr = ll.grad, method = "L-BFGS-B", lower = 1e-30, hessian = FALSE, control = list(fnscale = -1, maxit = 1000L)) theta.hat <- mle_numerical(res_optim, options = list(hessian = ll.hess(res_optim$par))) }, error = function(e) { cat("Sample size", nrow(data), " | Anneal:", start$par, " | ", e$message) if (!is.null(extra_info)) { cat(" | ", extra_info) } cat("\n") }) theta.hat } exp_experiment_3_gen( csv_filename = "exp_experiment_3.csv", #R = 1000, p = .333, q = .25, append = FALSE, use_aneal_start = TRUE)
The purpose of this data set is to analyze the sensivity of the exponential series system (7 components, all roughly the same failure rate) with respect to a sample size.
Here is how we generated this data set (we do not evaluate this code, since we already generated the data set and saved it to a file):
source("exp_experiment_gen.R") exp_experiment_gen( R = 1000, bernoulli_probs = .333, quants = .25, sample_sizes = c(5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100, 250, 500, 1000, 2000), theta = c(1, 1.1, 0.975, 1.125, 1.1, 1.0, 1.05), seed = 7231, use_aneal_start = TRUE, append = FALSE, csv_filename = "exp_experiment_3.csv")
We read the data set from the file with:
df <- read.csv("exp_experiment_3.csv") kable(df[1:5, 1:10]) kable(df[1:5, 11:20]) kable(df[1:5, 21:ncol(df)])
df_long <- df %>% pivot_longer( cols = starts_with("coverage"), names_to = "Coverage", values_to = "Value") # Convert the "Coverage" column to a factor for better plotting df_long$Coverage <- as.factor(df_long$Coverage) # Plot, with a solid line for the asymptotic coverage probability # for 0.95% confidence level ggplot(df_long, aes(x = N, y = Value, color = Coverage)) + geom_point() + geom_line() + geom_hline(yintercept = 0.95, linetype = "dashed") + labs(x = "Sample Size (N)", y = "Coverage Probability", title = "Coverage Probability vs Sample Size", color = "Coverage")
Analysis here.
############# BIAS ############# df_long <- df %>% pivot_longer( cols = starts_with("bias"), names_to = "Bias", values_to = "Value") # Convert the "Bias" column to a factor for better plotting df_long$Bias <- as.factor(df_long$Bias) # Plot, with a solid blue line for the asymptotic zero bias ggplot(df_long, aes(x = N, y = Value, color = Bias)) + geom_point() + geom_line() + geom_hline(yintercept = 0, color = "blue", linetype = "dashed") + labs(x = "Sample Size (N)", y = "Bias", title = "Bias vs Sample Size", color = "Bias")
Analysis here.
cutoff <- 70 df_small <- df %>% filter(N <= cutoff) df_large <- df %>% filter(N > cutoff) p1 <- ggplot(df_small) + geom_point(aes(x = N, y = mse)) + geom_line(aes(x = N, y = mse, color = "Simulation")) + geom_line(aes(x = N, y = mse_asym, color = "Asymptotic"), linetype = "dashed") + labs(x = "Small Sample Size", y = "Mean Squared Error", title = "Mean Squared Error vs Small Sample Size") p2 <- ggplot(df_large) + geom_point(aes(x = N, y = mse)) + geom_line(aes(x = N, y = mse, color = "Simulation")) + geom_line(aes(x = N, y = mse_asym, color = "Asymptotic"), linetype = "dashed") + labs(x = "Large Sample Size", y = "Mean Squared Error", title = "Mean Squared Error vs Large Sample Size") library(gridExtra) grid.arrange(p1, p2, ncol = 2)
Analysis here.
df_long <- df %>% pivot_longer( # regex to match column names "se<digit>" cols = matches("se[0-9]"), names_to = "SE", values_to = "Value") # Convert the "SE" column to a factor for better plotting df_long$SE <- as.factor(df_long$SE) # Combine the two plots using facet df_long$Group <- ifelse(df_long$N <= 150, "N < 150", "N >= 150") ggplot(df_long, aes(x = N, y = Value, color = SE)) + geom_point() + geom_line() + facet_wrap(~ Group, scales = "free") + labs(x = "Sample Size (N)", y = "SE", title = "SE vs Sample Size", color = "SE")
Analysis here.
cutoff <- 70 df_ratio <- df %>% mutate( Ratio1 = se_asym1/se1, Ratio2 = se_asym2/se2, Ratio3 = se_asym3/se3, Ratio4 = se_asym4/se4, Ratio5 = se_asym5/se5, Ratio6 = se_asym6/se6, Ratio7 = se_asym7/se7) %>% select(matches("Ratio[0-9]"), N) %>% mutate(mean_ratio = rowMeans(select(., matches("Ratio[0-9]")))) df_ratio_small <- df_ratio %>% filter(N <= cutoff) df_ratio_large <- df_ratio %>% filter(N > cutoff) p3 <- ggplot(df_ratio_small, aes(x = N, y = mean_ratio)) + geom_point() + geom_line() + geom_hline(yintercept = 1, linetype = "dashed") + ylim(0, 7) + labs(x = "Sample Size (N)", y = "Mean Ratio of Asymptotic to Simulated SE", title = "Mean Ratio of Asymptotic to Simulated SE vs Small Sample Size") p4 <- ggplot(df_ratio_large, aes(x = N, y = mean_ratio)) + geom_point() + geom_line() + geom_hline(yintercept = 1, linetype = "dashed") + ylim(0, 7) + labs(x = "Sample Size (N)", y = "Mean Ratio of Asymptotic to Simulated SE", title = "Mean Ratio of Asymptotic to Simulated SE vs Large Sample Size") df_ratio_long <- df_ratio %>% pivot_longer( cols = matches("Ratio[0-9]"), names_to = "Ratio", values_to = "Value" ) # Split df_ratio_long into two dataframes by sample size df_ratio_long_small <- df_ratio_long %>% filter(N < cutoff) df_ratio_long_large <- df_ratio_long %>% filter(N >= cutoff) # Create the two plots p1 <- ggplot(df_ratio_long_small, aes(x = N, y = Value, color = Ratio)) + geom_point() + geom_line() + labs(x = "Small Sample Size", y = "Ratio of Asymptotic to Simulated SE", title = "Ratio of Asymptotic to Monte Carlo SE vs Small Samples", color = "Ratio") p2 <- ggplot(df_ratio_long_large, aes(x = N, y = Value, color = Ratio)) + geom_point() + geom_line() + labs(x = "Large Sample Size", y = "Ratio of Asymptotic to Simulated SE", title = "Ratio of Asymptotic to Monte Carlo SE vs Large Samples", color = "Ratio") # Arrange the plots side by side grid.arrange(p1, p2, p3, p4, ncol = 2, nrow = 2)
Analysis here.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.