source(here::here("inst/book/setup.R"))

Toolbox

R and RStudio

Many tools exist to do Data Analysis and Statistics with different degrees of power and difficulty such as:

Choosing the right set of tools for Data Science is often not a very scientific task. Mostly is a matter of what is available and what our colleagues, customers or suppliers use. As with everything it is important to remain open to evaluate new tools and approaches and even to be able to combine them.

In this book we've chosen to provide all examples in R which is a free software environment for statistical computing and graphics. Besides taste and personal preference R brings a significant number of specific advantages in the field of Industrial Data Science:

R allows for reproducible research

This because the algorithms and functions defined to make the calculations can be inspected and all results can be fully reproduced and audited. This is known as reproducible research and is a critical aspect in all areas where a proof is needed such as in equipment validation and product quality reporting.

R functions and tools can be audited and improved

Being an open source language, all R libraries and packages added to the basic environment can be audited, adapted and improved. This is very important because when we enter into details every industry has a slight different way of doing things, different naming conventions, different coeficients and so on.

R is extensible

R is compatible with most other software on the market and is an excellent "glue" tool allowing for example for data loading from excel files, producing reports in pdf and even building complete dashboards in the form of web pages.

Large documentation is available on installing and learning R, starting with the official sites R-project and RStudio.

Packages

All industry specific tools applied throughout this book are available in the form of packages of the programming language R. As with all open source code, they're all available for download with freedom to modification and at no cost.

The amount of packages available is extremely large and growing very fast. When selecting new packages it is recommended to check the latest package update. Packages that have had no improvements since more than a couple of years should be questioned. The field evolves rapidly and compatibility and other issues can become painful. Two ways of obtaining statistics on package history are metacran and RStudio package manager.

Additionally an original package named {industRial} has been developed as a companion package for this book.

Installation {#installation}

The companion package to this book can downloaded from github with the following command:

devtools::install_github("J-Ramalho/industRial")

The list below identifies which are the remaining packages required to run all the coded of the book examples. Note that it is not required to download them all at once.

The column Download precises if a package is downloaded automatically when the {industRial} package is downloaded (imports) or if it needs be downloaded manually by the reader (suggests). In technical terms this differentiation corresponds to the DESCRIPTION file and allows for a progressive installation of the required software.

installed_packages <- bind_rows(
  installed.packages(lib.loc = .Library) %>% as_tibble(),
  installed.packages(lib.loc = .libPaths()[1]) %>% 
    as_tibble()
)

rpackages <- read_csv("rpackages.csv")

recommended_packages <- rpackages %>%
  left_join(installed_packages) %>%
  filter(!is.na(Download)) %>%
  select(Download, Package, Domain, Version, Depends) %>%
  arrange(Download, Domain)

recommended_packages %>%
  kable()

In the book text we don't see the loading instructions for the installed packages over and over again everytime an example is given to avoid repetition (e.g. running library(dplyr) before each code chunk). Be sure to load at minimum the packages below before trying any example:

ds_pkgs <- c("tidyverse", "scales", "janitor", "knitr", "stats", "industRial",
  "viridis", "broom", "patchwork")
purrr::map(ds_pkgs, library, character.only = TRUE)

Beware of the common issue of function masking. This happens more often in R when compared to python. As we tend to load all the sets of functions from each package we end up with conflicting function names. In the scope of this text it is mostly the function filter() from {dplyr} which conflicts with the function with the same name from {stats}. We tackle this with the simple technique of adding filter <- dplyr::filter in the beginning of our script to precise which function we want to give priority and we pre-append the package name to all calls of the other function such as stats::filter. For more sophisticated ways to handle this issue we suggest the package {import}.

Highlights

We're highlighting now some specific packages that are used in the book and that bring powerful features in analysis of data from R&D and Operations. Wherever they are required in the book we loaded them explicitly in the text to help tracking where the specific functions come from.

six sigma

{SixSigma} is a complete and robust R package. It provides many well tested functions in the area of quality and process improvement. We're presenting a full example with the gage r&R function in our MSA Case Study. As many other industrial packages, the {SixSigma} package is from before the {tidyverse} era and its plots have been not been developed with {ggplot2}. This makes integration in newer approaches harder. The data output is still nevertheless fully exploitable and very useful. The package is part of an excellent book with the same name published by @Cano2012.

qcc

The Quality Control Charts package, {qcc} is another very complete and solid package. It offers a very large range of statistical process control charts and capability analysis. Several examples of the control charts are available in its vignette: qcc vignette that we develop further in our SPC Case Studies.

qicharts2 {#qicharts2}

We recommend {qichart2} specifically for the good pareto plots. The package also provides statistical process control charts which are based on {ggplot2} and can serve as an easier alternative to the {qcc} package. As many niche packages we need to be aware that the number of contributors is small meaning that it cannot be as thoroughly tested as community packages.

DoE.base

This package is one of the most complete and vast packages in Design of Experiments. {DoE.base} is a first of a large suite of packages on the topic, it has vast functionality and is very well documented. We do some exploration of the automatic generation of designs in the DOE case studies. Full documentation available under: DoE.base

agricolae

Agricolae is a long tested package in the domain of design of experiments. It has been developed for the domain of Agricultural Research but can be used elsewhere. We make a small use, specifically to obtain the function fisher LSD but believe the package has a wealth of functions and methodologies to be explored.

rsm

The Response Surface Methods {rsm} is the best option in our view to produce 3D plots from linear models. It has specific features to use directly the models removing all the work to produce the data and feed generic 3D plotting functions. The package is larger than this and contains many support functions in the domain of design of experiments.

car

The {car} package which stands for Companion for Applied Regression is also used in many occasions as it contains many useful functions to assess the performance of linear models and anova. This package is combined with a complete book by @Fox2019.

RcmdrMisc

This package by the same author of the {car} provides additional miscellaneous functions for statistical analysis. Although it is part of a point and click interface for R we value it for its functions and plots in the domain of linear regression.

broom

The mission of {broom} is to Convert statistical objects into tidy tibbles. This is quite usefull when we want to reuse the output of the statistical analysis such as the R-squared in data pipelines with {tidyverse} packages. It becomes specially handy to obtain printing quality outputs in {Rmarkdown} documents with tables rendered with {kable} and in {shiny} apps. Several examples are present throughtout our book and mostly in the tutorials.

skimr {#skimr}

This package comes from ropensci a strong and open community supported by large global organizations such as the NASA. {skimr} is an interesting and powerful alternative to the base summary() function. Two main features make it a strong candidate for regular utilization: the first is its tight integrated with {tidyverse} and {knitr} making it possible to integrate it in pipelines, filtering and so on and in Rmarkdown chunks with specific printing arguments; the second is its extensive customization capabilities allowing to add and remove indicators, data types and presentation formats and agregration levels.

stats

Many functions from the packages discussed before are built on the large and extremely well tested {stats} package. This package is installed directly with R and consolidates software code that has been improving and tested for decades. As an example the source code of the lm function has close to 1000 lines. The complete package has more than 400 functions that can be listed with library(help = "stats") or ls("package:stats")

Datasets {#datasets}

All datasets presented throughout the book are fully anonymized. Once the package is correctly installed it can be loaded in the R session as usual with the library() function.

devtools::install_github("J-Ramalho/industRial")
library(industRial)

The primary goal of {industRial} is to make easily available all the data sets from all case studies. We can easily look for a data set by typing industRial:: and tab. The complete list can also be obtained with the snippet below:

data(package = "industRial") %>%
  pluck("results") %>%
  as_tibble() %>%
  select(Item, Title) %>%
  kable()

Once the package is loaded the data objects become immediately available in memory and can directly be used in the examples presented or for further exploration. Lets confirm this invoking the first data set:

dial_control %>%
  head() %>%
  kable()

The dateset can be used and manipulated like any other dataset created in the session or loaded otherwise. For example it can be filtered and assigned to a new variable name:

dial_peter <- dial_control %>%
  filter(Operator == "Peter") 
dial_peter %>%
  head(2) %>%
  kable()

Functions {#functions}

Besides the data sets the {industRial} package also contains toy functions to plot Statistical Process Control (SPC) charts. The objective here is to showcase how to build such functions and their scope of application is limited to the book case studies. For complete and robust SPC functions we recommend using the {QCC} package also described below.

Additionally the package contains theme functions to print and customize the aesthetics of spc charts and other charts. These themes are built on top of the {ggplot2} by H.Wickham and {cowplot} package by Claus O.Wilke. The main objective is to give the reader a starting point for customization of charts in this domain.

A functions can conveniently be accessed on the console with industRial:: and then tab. The complete list of themes and functions can be seen with:

lsf.str("package:industRial") %>%
  unclass() %>%
  as_tibble() %>%
  kable()

For each function a help page is available and can be obtained the same way as any other R data sets, themes and functions with ?<object> (e.g. ?chart_xbar)

To go even deeper and get access to all the code, the original book Rmd files are also bundled in the package and can be seen in the book folder. A way to get the exact folder path is:

paste0(.libPaths()[1], "/industRial/book")

Tutorials {#tutorials}

A set of practical exercises on key concepts presented throughout this book is available either on the web or locally, instructions follow.


On the web

the tutorials in the list below are published on the shinyapp.io server and can be freely accessed with the links below:

+------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Topic/Link | Content | +==================================================================+======================================================================================================================================================================================================================================================================================+ | Pareto chart | This tutorial builds on the The dial polishing workshop case study from the Design for Six Sigma chapter, train building pareto charts using the {qichart2} package and explore how playing with different variables gives new insights into apparently simple data collections. | +------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | DOE Anova | This tutorial explores how the p value is calculated by playing with a dynamic anova chart. This exercise is based on the The e-bike frame hardening process of the DOE Interactions chapter. | +------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Response Surface | This tutorial tests 3D visualization skills by playing with 3D response surface plots and the related interaction plots using the battery_charging dataset and the {rsm} package. | +------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Process Capability | In this tutorial we can play with the process centering variability and see how this is translated in the process indicators "percentage out of spec" and Cpk. | +------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+



Locally

The same set of tutorials can also be run locally which can be convenient as they load faster. This also allows for further exploration as the original tutorial code becomes available. For downloading instructions refer to the packages (#installation) session.

Next load the packages:

library(industRial)
library(learnr)

and list the tutorials with:

learnr::available_tutorials(package = "industRial")

choose a tutorial and run it as follows:

learnr::run_tutorial(package = "industRial", "anova")

The original files are available in the package tutorials folder. Their names correspond to the tutorial names listed before so there is a simple way to open the desired file, e.g.:

rstudioapi::navigateToFile(
  paste0(.libPaths()[1], "/industRial/tutorials/anova/anova.Rmd")
  )


Try the industRial package in your browser

Any scripts or data that you put into this service are public.

industRial documentation built on June 11, 2021, 5:10 p.m.