knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 6 )
library(MATH4793MOSER)
This project reproduces results from example 5.11 in our textbook that seeks to demonstrate a case where one would need to do more analysis than just computing $T^2$ values to identify outlier observations in a data set.
The example data in question is related to certain measurements made on an automated process. There are 4 variables: voltage, current, feedspeed, and gas flow.
welder <- MATH4793MOSER::welder head(welder)
We must check the varaibles for approximate normality to be able to satisfy the conditions for the statistical tests and plots we will produce later. We will use univariate, bivariate, and chi-squared checks for normality. Other assumptions necessary are that there is not a time relation between the observations. We do not investigate that in-depth here, though the book mentions it is reasonable to assume that each observation is sufficiently independent over time.
The book suggests a logarithmic transformation to the fourth variable, gas flow. We will perform normality checks on the data set both before and after this transformation.
For each variable, we will create univariate QQ plots to assess normality. I will also use the Shapiro-Wilk test to supplement these QQ plots.
A univariate QQ-plot plots the sorted data for a given variable against the anticipated, evenly-spaced quantiles from a normal distribution. If the variable is close to normal, this plot will be approximately linear. The R package s20x's Shapiro-Wilk test recreates these QQ-plots and also shows a histogram of the varaible against a normal distribution. These QQ plots verify the accuracy of those created by my package.
voltageqq <- MATH4793MOSER::univariateqq(welder$voltage) plot(voltageqq$TheoreticalQuantiles,voltageqq$Data) currentqq <- MATH4793MOSER::univariateqq(welder$current) plot(currentqq$TheoreticalQuantiles,currentqq$Data) feedspeedqq <- MATH4793MOSER::univariateqq(welder$feedspeed) plot(feedspeedqq$TheoreticalQuantiles,feedspeedqq$Data) gasflowqq <- MATH4793MOSER::univariateqq(welder$gasflow) plot(gasflowqq$TheoreticalQuantiles,gasflowqq$Data) s20x::normcheck(welder$voltage) s20x::normcheck(welder$current) s20x::normcheck(welder$feedspeed) s20x::normcheck(welder$gasflow)
Based on the above output, it is appropriate to assume normality for each of the variables except gas flow. The textbook suggests a log transformation for gas flow which is consistent with these findings. Below, we will also perform bivariate normality and chi-squared multivariate normality check. It is reasonable to assume that bivariate checks involving gas flow will suggest non-normality and that the chi-squared QQ plot may also not reflect that the dataset is normal because of the gas flow variable.
I use my packages bivariate normality check to check bivariate normality for each pair of variables. This bivariate normality checks calculates the proportion of the observations falling within the anticipated 50% confidence region (based on the bivariate normal distribution) if the two variables were bivariate normal. Proportions straying from 0.5 do not support bivariate normality.
MATH4793MOSER::bivariatenormcheck(welder$voltage,welder$current) MATH4793MOSER::bivariatenormcheck(welder$voltage,welder$feedspeed) MATH4793MOSER::bivariatenormcheck(welder$voltage,welder$gasflow) MATH4793MOSER::bivariatenormcheck(welder$current,welder$feedspeed) MATH4793MOSER::bivariatenormcheck(welder$current,welder$gasflow) MATH4793MOSER::bivariatenormcheck(welder$feedspeed,welder$gasflow)
As expected, the bivariate normality checks involving gas flow have proportions that depart somewhat notably from the expected proportion of 0.5. Across the board, bivariate normality seems an appropriate assumption for the expected 50% contours.
We create a chi-squared QQ plot to assess normality aggregate across all four varaibles. In these plots, the quadratic statistical distance is calulcated for each variable, sorted, and plotted against theoretical chi-squared quantiles. Like the univariate QQ plots, approximate linearity suggests that the data set as a whole is comprised of normal variables.
chiqq <- chisquaredqq(welder) plot(chiqq$TheoQuants, chiqq$CompQuants, main = "Chi-Squared QQ Plot", xlab = "Theoretical Quantiles", ylab = "Computed Quantiles") ``` The chi-squared plot appears mostly linear, suggesting normality across the data set. #### Conclusions As mentioned, the normality checks applied (univariateqq, bivariate normal, and a chi-squared qq plot) suggest it would be reasonable to assume the varaibles are distributed normally. Only the gas flow variable may need further attention. It's univariate qq plot was the least linear and bivaraite nromal checks involving it departed the most notably from the expected 50% proportion of data. For these reasons, we will apply a natural log transformation to that variable specifically. ### After Suggested Log Transformation We now apply the suggested transformation to the data. #### Transforming the Data Here, we apply the logarithmic transformation to the gas flow variable and perform the same normality checks to assure that the transformation makes the gas flow variable more normal than it was prior. ```r welder$gasflow <- log(welder$gasflow) head(welder)
We recreate the same univariate QQ plots.
voltageqq <- MATH4793MOSER::univariateqq(welder$voltage) plot(voltageqq$TheoreticalQuantiles,voltageqq$Data) currentqq <- MATH4793MOSER::univariateqq(welder$current) plot(currentqq$TheoreticalQuantiles,currentqq$Data) feedspeedqq <- MATH4793MOSER::univariateqq(welder$feedspeed) plot(feedspeedqq$TheoreticalQuantiles,feedspeedqq$Data) gasflowqq <- MATH4793MOSER::univariateqq(welder$gasflow) plot(gasflowqq$TheoreticalQuantiles,gasflowqq$Data) s20x::normcheck(welder$voltage) s20x::normcheck(welder$current) s20x::normcheck(welder$feedspeed) s20x::normcheck(welder$gasflow)
Re-calculation of the bivariate normality checks with the transformed gas flow variable.
MATH4793MOSER::bivariatenormcheck(welder$voltage,welder$current) MATH4793MOSER::bivariatenormcheck(welder$voltage,welder$feedspeed) MATH4793MOSER::bivariatenormcheck(welder$voltage,welder$gasflow) MATH4793MOSER::bivariatenormcheck(welder$current,welder$feedspeed) MATH4793MOSER::bivariatenormcheck(welder$current,welder$gasflow) MATH4793MOSER::bivariatenormcheck(welder$feedspeed,welder$gasflow)
We recreate the chi-squared QQ plot with the transformed data.
chiqq <- chisquaredqq(welder) plot(chiqq$TheoQuants, chiqq$CompQuants, main = "Chi-Squared QQ Plot", xlab = "Theoretical Quantiles", ylab = "Computed Quantiles")
We can see that the logarithmic transformation did bring the gas flow varaible's distribution closer to normal (as seen in the univariate checks) and in making the chi-squared QQ plot for the entire data closer to straight. This implies the transformation does indeed help us better assume normality conditions.
An R function built in the package MATH4793MOSER recreates the T-squared chart from the example. This chart shows the T-squared statistic calculated for each observation and plots it against upper control limits defined by 95% and 99% confidence.
tsquaredchart(welder)$Plot
Like in the book, we can see no observation is considered an outlier at 99% confidence and only one at 95% confidence.
Here I recreate figure 5.10 using a function that can generate a generic confidence ellipse for a given confidence level.
fig5.10 <- confidenceellipse(welder$voltage,welder$gasflow,0.01) fig5.10$Plot
Like in the book, we see an outlier lying outside of the ellipse.
A final recreation of the univariate quality control chart for gas flow is created.
univariatexbarchart(welder$gasflow)$Plot
This chart suggests that there is indeed an outlier observation as there is one where the gas flow variable is 3 standard deviations greater than the mean. The aggregate of these three plots suggests that that observation should be considered an outlier and would not be detected by the T-squared investigation alone.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.