knitr::opts_chunk$set( collapse = TRUE, screenshot.force = FALSE, comment = "#>" ) library(weibulltools)

This document presents non-parametric estimation methods for the computation of failure probabilities of complete data (failures) taking (multiple) right-censored units into account. A unit can be either a single component, an assembly or an entire system.

Further, the estimation results are presented in distribution-specific probability plots.

If the lifetime (or any other damage-equivalent quantity such as distance or
load cycles) of a unit is considered to be a continuous random variable *T*,
then the probability that a unit has failed at *t* is defined by its
*CDF (cumulative distribution function)* *F(t)*.

$$ P(T\leq t) = F(t) $$

In order to obtain an estimate of the *CDF* for each observation $t_1, t_2, ..., t_n$
two approaches are possible. Using a parametric lifetime distribution requires that
the underlying assumptions for the sample data are valid. If the distribution-specific
assumptions are correct, the model parameters can be estimated and the *CDF* is
computable. But if assumptions are not held, interpretations and derived conclusions
are not reliable.

A more general approach for the calculation of cumulative failure probabilities is
to use non-parametric statistical estimators $\hat{F}(t_1), \hat{F}(t_2), ..., \hat{F}(t_n)$.
In comparison to a parametric distribution no general assumptions must be held.
For non-parametric estimators, an ordered sample of size $n$ is needed. Starting
at $1$, the ranks $i \in {1, 2, ..., n }$ are assigned to the ascending sorted
sample values. Since there is a known relationship between ranks and corresponding
ranking probabilities a *CDF* can be determined.

But rank distributions are systematically skewed distributions and thus the median
value instead of the expected value $E\left[F\left(t_i\right)\right] = \frac{i}{n + 1}$
is used for the estimation [^note1]. This skewness is visualized in *Figure 1*.

[^note1]: Kapur, K. C.; Lamberson, L. R.: *Reliability in Engineering Design*,
*New York: Wiley*, 1977, pp. 297-301

library(dplyr) # data manipulation library(ggplot2) # visualization x <- seq(0, 1, length.out = 100) # CDF n <- 10 # sample size i <- c(1, 3, 5, 7, 9) # ranks r <- n - i + 1 # inverse ranking df_dens <- expand.grid(cdf = x, i = i) %>% mutate(n = n, r = n - i + 1, pdf = dbeta(x = x, shape1 = i, shape2 = r)) densplot <- ggplot(data = df_dens, aes(x = cdf, y = pdf, colour = as.factor(i))) + geom_line() + scale_colour_discrete(guide = guide_legend(title = "i")) + theme_bw() + labs(x = "Failure Probability", y = "Density") densplot

In practice, a simplification for the calculation of the median value, also called
median rank, is made. The formula of *Benard's* approximation is given by

$$\hat{F}(t_i) \approx \frac{i - 0,3}{n + 0,4} $$

and is described in *The Plotting of Observations on Probability Paper* [^note2].

[^note2]: Benard, A.; Bos-Levenbach, E. C.: *The Plotting of Observations on Probability Paper*,
*Statistica Neerlandica 7 (3)*, 1953, pp. 163-173

However, this equation only provides valid estimates for failure probabilities if
all units in the sample are defectives (`estimate_cdf(methods = "mr", ...)`

).

In field data analysis, however, the sample mainly consists of intact units and
only a small fraction of units failed. Units that have no damage at the point of
analysis and also have not reached the operating time or mileage of units that
have already failed, are potential candidates for future failures. As these,
for example, still are likely to fail during a specific time span, like the
guarantee period, the *CDF* must be adjusted upwards by these potential candidates.

A commonly used method for correcting probabilities of (multiple) right-censored
data is *Johnson's* method (`estimate_cdf(methods = "johnson", ...)`

). By this method,
all units that are included in the period looked at are sorted in an ascending order
of their operating time (or any other damage-equivalent quantity). If there are units
that have not failed before the *i*-th failure, an adjusted rank for the *i*-th failure
is formed. This correction takes the potential candidates into account and increases
the rank number. In consequence, a higher rank leads to a higher failure probability.
This can be seen in *Figure 1*.

The rank adjustment is determined with:

$$j_i = j_{i-1} + x_i \cdot I_i, \;\; with \;\; j_0 = 0$$

Here, $j_ {i-1}$ is the adjusted rank of the previous failure, $x_i$ is the number of defectives at $t_i$ and $I_i$ is the increment that corrects the considered rank by the potential candidates.

$$I_i=\frac{(n+1)-j_{i-1}}{1+(n-n_i)}$$

The sample size is $n$ and $n_i$ is the number of units that have a lower $t$ than
the *i*-th unit. Once the adjusted ranks are calculated, the *CDF* can be estimated
according to *Benard's* approximation.

Other methods in `weibulltools`

that can also handle (multiple) right-censored data
are the *Kaplan-Meier* estimator (`estimate_cdf(methods = "kaplan", ...)`

) and the
*Nelson-Aalen* estimator (`estimate_cdf(methods = "nelson", ...)`

).

After computing failure probabilities a method called *probability plotting* is
applicable. It is a graphical *goodness of fit* technique that is used in assessing
whether an assumed distribution is appropriate to model the sample data.

The axes of a probability plot are transformed in such a way that the *CDF* of a
specified model is represented through a straight line. If the plotted points
(`plot_prob()`

) lie on an approximately straight line it can be said that the
chosen distribution is adequate.

The two-parameter Weibull distribution can be parameterized with parameters $\mu$
and $\sigma$ such that the *CDF* is characterized by the following equation:

$$F(t)=\Phi_{SEV}\left(\frac{\log(t) - \mu}{\sigma}\right)$$

The advantage of this representation is that the Weibull is part of the
(log-)location-scale family. A linearized representation of this *CDF* is:

$$\Phi^{-1}_{SEV}\left[F(t)\right]=\frac{1}{\sigma} \cdot \log(t) - \frac{\mu}{\sigma}$$

This leads to the following transformations regarding the axes:

- Abscissa: $x = \log(t)$
- Ordinate: $y = \Phi^{-1}
*{SEV}\left[F(t)\right]$, which is the quantile function of the SEV (_smallest extreme value*) distribution and can be written out with $\log\left{-\log\left[1-F(t)\right]\right}$.

Another version of the Weibull *CDF* with parameters $\eta$ and $\beta$ results in
a *CDF* that is defined by the following equation:

$$F(t)=1-\exp\left[ -\left(\frac{t}{\eta}\right)^{\beta}\right]$$

Then a linearized version of the CDF is:

$$ \log\left{-\log\left[1-F(t)\right]\right} = \beta \cdot \log(t) - \beta \cdot \log(\eta)$$

Transformations regarding the axes are:

- Abscissa: $x = \log(t)$
- Ordinate: $y = \log\left{-\log\left[1-F(t)\right]\right}$.

It can be easily seen that the parameters can be converted into each other. The corresponding equations are:

$$\beta = \frac{1}{\sigma}$$

and

$$\eta = \exp\left(\mu\right).$$

To apply the introduced methods of non-parametric failure probability estimation
and probability plotting the `shock`

data is used. In this dataset kilometer-dependent
problems that have occurred on shock absorbers are reported. In addition to failed
items the dataset also contains non-defectives (*censored*) observations. The data
can be found in *Statistical Methods for Reliability Data* [^note3].

[^note3]: Meeker, W. Q.; Escobar, L. A.: *Statistical Methods for Reliability Data*,
*New York, Wiley series in probability and statistics*, 1998, p. 630

For consistent handling of the data, `weibulltools`

introduces the function
`reliability_data()`

that converts the original dataset into a `wt_reliability_data`

object. This formatted object allows to easily apply the presented methods.

shock_tbl <- reliability_data(data = shock, x = distance, status = status) shock_tbl

`weibulltools`

First, we are interested in how censored observations influence the estimation of
failure probabilities in comparison to the case where only failed units are considered.
To deal with survived and failed units we will use `estimate_cdf()`

with
`methods = "johnson"`

, whereas `methods = "mr"`

only considers failures.

# Estimate CDF with both methods: cdf_tbl <- estimate_cdf(shock_tbl, methods = c("mr", "johnson")) # First case where only failed units are taken into account: cdf_tbl_mr <- cdf_tbl %>% filter(cdf_estimation_method == "mr") cdf_tbl_mr # Second case where both, survived and failed units are considered: cdf_tbl_john <- cdf_tbl %>% filter(cdf_estimation_method == "johnson") cdf_tbl_john

If we compare both outputs we can see that survivors reduce the probabilities.
But this is just that what was expected since undamaged units with longer or equal
lifetime characteristic *x* let us gain confidence in the product.

`weibulltools`

The estimated probabilities should now be presented in a probability plot.
With `plot_prob()`

probability plots for several lifetime distributions can be constructed
and estimates of multiple methods can be displayed at once.

# Weibull grid for estimated probabilities: weibull_grid <- plot_prob( cdf_tbl, distribution = "weibull", title_main = "Weibull Probability Plot", title_x = "Mileage in km", title_y = "Probability of Failure in %", title_trace = "Method", plot_method = "ggplot2" ) weibull_grid

*Figure 3* shows that the consideration of survivors (orange points, *Method: johnson*)
decreases the failure probability in comparison to the sole evaluation of failed
items (green points, *Method: mr*).

Finally, we want to use a log-normal probability plot to visualize the estimated failure probabilities.

# Log-normal grid for estimated probabilities: lognorm_grid <- plot_prob( cdf_tbl, distribution = "lognormal", title_main = "Log-normal Probability Plot", title_x = "Mileage in km", title_y = "Probability of Failure in %", title_trace = "Method", plot_method = "ggplot2" ) lognorm_grid

On the basis of *Figure 3* and *Figure 4* we can subjectively assess the goodness
of fit of Weibull and log-normal. It can be seen that in both grids, the plotted
points roughly fall on a straight line. Hence one can say that the Weibull as well
as the log-normal are good model candidates for the `shock`

data.

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.