In statwonk/reliability:

Cloud computing has empowered developers to create very complex inter-connected systems with unique network topologies. Software engineers cope by automating configuration, standarizing operating procedures, validating system changes, soft-deployments ("canaries"), easy roll-back to known software versions, and fire drills (read more).

Absent from the discussion appears to be statistical techniques for understanding and improving system reliability. This software library seeks to create a bridge between statistical and engineering techniques.

Simple two component series

Consider the simplest setup, a web server and a database,

[ Web server @ 0.90 reliability ] ------> [ Database @ 0.90 reliability ]

the reliability of this system can be deconstructed into three terms,

= P(web server failure) + P(database failure) - [ P(web server failure) * P(database failure) ]

The likelihood of system failure for a two component series is the addition of each component's likelihood of failure e.g. P(web server failure) minus the likelihood that both components fail. The intuition is that it only takes one component failure to cause system failure, so we must subtract the unnecessary probability that both fail.

Using library(reliability), here's the calculation step-by-step:

library(magrittr)
library(reliability)

# just one node
node(name = "web server", reliability = 0.90)

# now showing probability of failure
node(name = "database",  reliability = 0.90) %>%
  prob_of_failure()

Our two-component system,

node(name = "web server", reliability = 0.90) %>%
  node(name = "database",  0.90)

Putting it together,

node(name = "web server", reliability = 0.90) %>%
  node(name = "database",  0.90) %>%
  prob_of_failure()

Customize for your cluster

Similar to Legos, reliability lets you build complex systems out of simple components (node). For example, we can add an arbitrary number of components to a series like,

node(name = "load balancer", reliability = 0.999) %>%
  node(name = "web server",  0.99) %>%
  node(name = "caching layer",  0.95) %>%
  node(name = "database",  0.9999) %>%
  prob_of_failure() %>% scales::percent()