knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(MATH5793POMERANTZ)
The purpose of this project is to go through Exercise 8.10 from page 473 of Applied Multivariate Statistical Analysis, 6th Edition, by Johnson and Wichern.
There are four subparts to this problem. The functions that I made for this project, in accordance with the guidelines, are called bf_int() and pca_sigma(), and are already packaged.
To perform this analysis, I just have to load in the data into the vignette and use the pca_sigma() function that I made.
stocks <- read.table("/Users/leahpomerantz/Desktop/Math\ 5793/Data/T8-4.DAT") stocks <- data.frame(stocks) colnames(stocks) <- c("JPMorgan", "Citibank", "WellsFargo", "RoyalDutchShell", "ExxonMobil") head(stocks)
Now that we've verified that we have the right data by looking at the first few entries, we can use the function:
pca <- pca_sigma(stocks) pca$sample.Cov pca$PC.Sigma
The first matrix above shows us the sample covariance matrix for our stocks data, and the second matrix shows us the coefficients of the principle components. This means that we can write out the principle components using $\LaTeX$, as shown below:
$\widehat{Y_1} = \hat{e_1}'X = 0.2228228X_1 + 0.3072900X_2 + 0.15481030X_3 + 0.6389680X_4 + 0.65090441X_5$
$\widehat{Y_2} = \hat{e_2}'X = 0.6252260X_1 + 0.5703900X_2 + 0.34450492X_3 - 0.2479475X_4 - 0.32184779X_5$
$\widehat{Y_3} = \hat{e_3}'X = 0.3261122X_1 - 0.2495901X_2 - 0.03763929X_3 - 0.6424974X_4 + 0.64586064X_5$
$\widehat{Y_4} = \hat{e_4}'X = 0.6627590X_1 - 0.4140935X_2 - 0.49704993X_3 + 0.3088689X_4 - 0.21637575X_5$
$\widehat{Y_5} = \hat{e_5}'X = 0.1176595X_1 - 0.5886080X_2 + 0.78030428X_3 + 0.1484555X_4 - 0.09371777X_5$
Our function above calculates the total sample variance (TSV) for us. All we need to do is divide the first three eigenvalues by the TSV, as done below:
eigenS <- eigen(pca$sample.Cov) lambda <- eigenS$values l1per <- lambda[1]/pca$TSV l2per <- lambda[2]/pca$TSV l3per <- lambda[3]/pca$TSV
From the work done above, we can see that the first principle component explains r l1per
of total sample variance, the second principle component explains r l2per
of the total sample variance, and the third principle component explains r l3per
of the data. This means that the vast majority of the variation in the total variance in the data can be explained by the first principle component, with still a decent amount explained by the second, and then not very much explained by the third. I would also use a scree plot to look at all five principle components before making a definitive statement, but, I think that this is a good start to arguing that we can reduce the data to just the first two principle components.
This one is also simply a matter of using the function I made, called bf_int.
bf_int(stocks, q = 3, alpha = 0.1)
I do believe that it can be summarized in fewer than five dimensions. Especially from what we saw in (b) and (c), I think we see a dramatic decrease in the explanatory power of the principle components between the second principle component and the third. The purpose of this project is to answer the questions as given, but, personally, I would prefer to do a scree plot and get a better visual. But, that being said, I think that, based on what we see in (b) alone, we've got evidence that we can use the principle components to reduce the data to only two dimensions. At the very least, we can certainly reduce it to three, given that the three components together explain 89.88% of the data.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.