knitr::opts_chunk$set( collapse = TRUE, screenshot.force = FALSE, comment = "#>" ) library(weibulltools) # set.seed() for reproducibility of random sampled id's set.seed(2905)

This document presents non-parametric methods for estimating the failure probabilities of units and their presentation in interactive visualizations. A unit can be a single component, an assembly or an entire system.

If the lifetime of a unit is considered to be a continuous random variable *T*,
then the probability that a unit has failed by a certain point in time or a distance
*t* is defined by its CDF *(cumulative distribution function)* *F(t)*.
$$ P(T\leq t) = F(t) $$

In order to obtain an estimate of the cumulative failure probability for each observation
$t_1, t_2, ..., t_n$ two approaches are possible.

Using a parametric lifetime distribution requires that the underlying assumptions
for the sample data are valid. If the distribution-specific assumptions are correct,
the model parameters can be estimated and the CDF is computable. But if the required
conditions could not be met, interpretations and derived conclusions are not reliable.

A more general approach for the calculation of the cumulative failure probability
is to use non-parametric statistical estimators
$\hat{F}(t_1), \hat{F}(t_2), ..., \hat{F}(t_n)$. In comparison to a parametric
distribution no general assumptions must be held. For non-parametric estimators,
an ordered sample of size n is needed. Starting at 1, the ranks
$i \in {1, 2, ..., n }$ are assigned to the ascending sorted sample values. Since
there is a known relationship between ranks and corresponding ranking probabilities
a CDF can be calculated.

But rank distributions are systematically skewed distributions and thus the median value
instead of the expected value $E\left[F\left(t_i\right)\right] = \frac{i}{n + 1}$
is used for estimation [^note1]. This skewness is visualized in *Figure 1*.

library(tidyverse) # using dplyr manipulation functions and ggplot2 x <- seq(0, 1, length.out = 100) # CDF n <- 10 # sample size i <- c(1, 3, 5, 7, 9) # ranks r <- n - i + 1 # inverse ranking df_dens <- expand.grid(cdf = x, i = i) %>% mutate(n = n, r = n - i + 1, pdf = dbeta(x = x, shape1 = i, shape2 = r)) densplot <- ggplot(data = df_dens, aes(x = cdf, y = pdf, colour = as.factor(i))) + geom_line() + scale_colour_discrete(guide = guide_legend(title = "i")) + theme_bw() + labs(x = "Failure Probability", y = "Density") densplot

[^note1]: Kapur, K. C.; Lamberson, L. R.: *Reliability in Engineering Design*,
*New York: Wiley*, 1977, pp. 297-301

In practice, a simplification for the calculation of the median value,
also called median rank, is made. The formula of *Benard's Approximation* is given by
$$\hat{F}(t_i) \approx \frac{i - 0,3}{n + 0,4} $$
and is described in _The Plotting of Observations on Probability Paper _ [^note2].

[^note2]: Benard, A.; Bos-Levenbach, E. C.: *The Plotting of Observations on Probability Paper*,
*Statistica Neerlandica 7 (3)*, 1953, pp. 163-173

However, this equation only provides valid estimates for failure probabilities if
all units in the sample are defectives (`mr_method()`

).

In field data analysis, however, the sample mainly consists of intact units and
only a small fraction of units failed. Units that have no damage at the point of
analysis and also have not reached the operating time or mileage of units that have
already failed, are potential candidates for future failures.

As these, for example, still are likely to fail during a specific time span, like
the guarantee period, the failure probability must be adjusted upwards by these potential candidates.

A commonly used method for correcting probabilities of (multiple) right censored data
is Johnson's method (`johnson_method()`

). By this method, all units that fall into the period
looked at are sorted in an ascending order of their operating time or mileage. If there are units
that have not failed before the *i*-th failure, an adjusted rank for the *i*-th failure
is formed. This correction takes the potential candidates into account and increases
the rank number. In consequence, a higher rank leads to a higher failure probability.
This can be seen in *Figure 1*.

The rank adjustment (`calculate_ranks()`

) is calculated as follows:
$$j_i = j_{i-1} + x_i \cdot I_i, \;\; with \;\; j_0 = 0$$

Here, $j_ {i-1}$ is the adjusted rank of the previous failure, $x_i$ is the number of defectives at time/distance $t_i$ and $I_i$ is the increment that corrects the rank by the candidates. $$I_i=\frac{(n+1)-j_{i-1}}{1+(n-n_i)}$$

The sample size is $n$ and $n_i$ is the number of units that have a lower operating
time/mileage than the *i*-th unit. Once the adjusted ranks are calculated, the
failure probabilities can be estimated according to *Benard's Approximation*.

Other methods in `weibulltools`

that can also handle (multiple) right censored data
are the Kaplan-Meier estimator (`kaplan_method()`

) and the Nelson-Aalen estimator
(`nelson_method()`

).

After computing failure probabilities a method called *Probability Plotting*
is applicable. It is a graphical *goodness of fit* technique that is used in
assessing whether an assumed distribution is appropriate to model the sample data.

The axes of a probability plot are transformed in such a way that the CDF of
a specified model is represented through a straight line (`plot_layout()`

). If the
plotted points (`plot_prob()`

) fall on an approximately straight line it can be said
that the chosen distribution is adequate.

The two-parameter Weibull distribution can be parameterized with $\eta$ and $\beta$
such that the CDF is characterized by the following equation:

$$F(t)=1-\exp\left[ -\left(\frac{t}{\eta}\right)^{\beta}\right]$$
Then a linearized version of the CDF is:
$$ \log\left[-\log(1-F(t))\right] = \beta \cdot \log(t) - \beta \cdot \log(\eta)$$
This leads to the following transformations regarding the axes:

- Abscissa: $x = \log(t)$
- Ordinate: $y = \log\left[-\log(1-F(t))\right]$.

Another version of the Weibull CDF such that the distribution is part of the
log-location-scale family with parameters $\mu$ and $\sigma$ is:

$$F(t)=\Phi_{SEV}\left(\frac{\log(t) - \mu}{\sigma}\right)$$
A linearized representation of this CDF is:
$$\Phi^{-1}_{SEV}\left(F(t)\right)=\frac{1}{\sigma} \cdot \log(t) - \frac{\mu}{\sigma}$$
This leads to the following transformations regarding the axes:

- Abscissa: $x = \log(t)$
- Ordinate: $y = \Phi^{-1}
*{SEV}\left(F(t)\right)$, which is the quantile function of the SEV (_smallest extreme value*) distribution.

It can be easily seen that the parameters can be converted into each other. The corresponding equations are:

$$\beta = \frac{1}{\sigma} \;\; and $$

$$\eta = \exp\left(\mu\right).$$

To apply the introduced methods of non-parametric failure probability estimation
and probability plotting the `shock`

data taken from `SPREDA`

package is used.
In this dataset kilometer-dependent problems that have occurred on shock absorbers are
reported. In addition to failed items the dataset also contains non-defectives,
so called *censored* observations.

The data can be found in *Statistical Methods for Reliability Data* [^note3].

[^note3]: Meeker, W. Q.; Escobar, L. A.: *Statistical Methods for Reliability Data*,
*New York, Wiley series in probability and statistics*, 1998, p. 630

library(SPREDA) # for dataset shock data(shock) # generate random ids for units: shock$id <- sample(c(letters, LETTERS), size = nrow(shock), replace = FALSE) # using tibble for better print: as_tibble(shock) # Comparison of failure modes: ggplot(data = shock, aes(x = Mode, y = Distance)) + geom_boxplot() + theme_bw()

`weibulltools`

For reasons of simplicity we will ignore the differences between the failure
modes *Mode1* and *Mode2* which are shown in *Figure 2*. Thus, we will act as
there is only one mechanism of damage.

First, we are interested in how censored observations influence the estimation of
failure probabilities in comparison to the case where only failed units are considered.
In the latter case we will use the function `mr_method()`

. To deal with survived
and failed units we will use function `johnson_method()`

.

# First case where only failed units are taken into account: df_mr <- mr_method(id = shock$id[shock$Censor == 1], x = shock$Distance[shock$Censor == 1], event = shock$Censor[shock$Censor == 1]) knitr::kable(df_mr, format = "html", row.names = FALSE, align = "c", caption = "Table 1: Failure probabilities using failed items.") # Second case where both, survived and failed units are considered: df_john <- johnson_method(id = shock$id, x = shock$Distance, event = shock$Censor) knitr::kable(df_john, format = "html", row.names = FALSE, align = "c", caption = "Table 2: Failure probabilities using all items.")

If we compare *Table 1* and *Table 2* we can see that survivors decrease probabilities.
But this is just that what was expected since undamaged units with longer or equal
operation times (here mileage) let us gain confidence in the product.

`weibulltools`

The next step is to visualize the estimated probabilities in a probability plot.
With function `plot_prob()`

we can construct plots for several lifetime distributions.
Here we want to use a Weibull grid in which the estimates, given in *Table 1* and
*Table 2*, are plotted. With `plot_prob()`

we can visualize the estimates of one
table (for example *Table 2*). To get the estimates of the other table (here *Table 1*)
in the same graph, we have to add an additional trace (`add_trace()`

function of
`plotly`

package). As a result the obtained estimates can be compared graphically.

# Weibull grid for probabilities calculated with Johnson: weibull_grid <- plot_prob(x = df_john$characteristic, y = df_john$prob, event = df_john$status, id = df_john$id, distribution = "weibull", title_main = "Weibull Probability Plot", title_x = "Mileage in km", title_y = "Probability of Failure in %", title_trace = "Failures (Johnson)") library(plotly) # Using add_trace() # Adding a trace so that estimated probabilities of mr_method can be plotted in # the same graph: # Arguments inside add_trace: # y: Must be transformed such that quantiles of smallest extreme value distribution are plotted. # x: Since distribution in plot_prob is "weibull" the x axis is already on log scale. # Thus x can be plugged in on natural scale. weibull_grid_both <- weibull_grid %>% add_trace(data = df_mr, type = "scatter", mode = "markers", x = ~characteristic, y = ~SPREDA::qsev(prob), name = "Failures (MR)", color = I("#006400"), hoverinfo = "text", text = ~paste("ID:", id, paste("<br>", paste0("Mileage", ":")), characteristic, paste("<br>", paste0("Probability", ":")), round(prob, digits = 5))) weibull_grid_both

*Figure 3* shows that the consideration of survivors (blue points, *Failures (Johnson)*)
decreases the failure probability in comparison to the sole evaluation of failed
items (green points, *Failures (MR)*).

Finally, we want to use a Log-normal probability plot to visualize the estimated
failure probabilities given in *Table 2*.

# Log-Normal grid for probabilities calculated with Johnson: lognorm_grid <- plot_prob(x = df_john$characteristic, y = df_john$prob, event = df_john$status, id = df_john$id, distribution = "lognormal", title_main = "Log-Normal Probability Plot", title_x = "Mileage in km", title_y = "Probability of Failure in %", title_trace = "Defect Shock Absorbers") lognorm_grid

On the basis of *Figure 3* and *Figure 4* we can subjectivly assess the goodness of fit of
Weibull and Log-normal. It can be seen that in both grids, the plotted points
roughly fall on a straight line. Hence one can say that the Weibull
as well as the Log-normal are good model candidates for the `shock`

data.

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.