In leahpom/MATH5793POMERANTZ: Math 5793 R functions

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(MATH5793POMERANTZ)

Purpose

The purpose of this project is to go through Exercise 8.10 from page 473 of Applied Multivariate Statistical Analysis, 6th Edition, by Johnson and Wichern.

Problem 8.10

There are four subparts to this problem. The functions that I made for this project, in accordance with the guidelines, are called bf_int() and pca_sigma(), and are already packaged.

(a)

To perform this analysis, I just have to load in the data into the vignette and use the pca_sigma() function that I made.

stocks <- read.table("/Users/leahpomerantz/Desktop/Math\ 5793/Data/T8-4.DAT")
stocks <- data.frame(stocks)
colnames(stocks) <- c("JPMorgan", "Citibank", "WellsFargo", "RoyalDutchShell", "ExxonMobil")
head(stocks)

Now that we've verified that we have the right data by looking at the first few entries, we can use the function:

pca <- pca_sigma(stocks)
pca$sample.Cov
pca$PC.Sigma

The first matrix above shows us the sample covariance matrix for our stocks data, and the second matrix shows us the coefficients of the principle components. This means that we can write out the principle components using $\LaTeX$, as shown below:

$\widehat{Y_1} = \hat{e_1}'X = 0.2228228X_1 + 0.3072900X_2 + 0.15481030X_3 + 0.6389680X_4 + 0.65090441X_5$

$\widehat{Y_2} = \hat{e_2}'X = 0.6252260X_1 + 0.5703900X_2 + 0.34450492X_3 - 0.2479475X_4 - 0.32184779X_5$

$\widehat{Y_3} = \hat{e_3}'X = 0.3261122X_1 - 0.2495901X_2 - 0.03763929X_3 - 0.6424974X_4 + 0.64586064X_5$

$\widehat{Y_4} = \hat{e_4}'X = 0.6627590X_1 - 0.4140935X_2 - 0.49704993X_3 + 0.3088689X_4 - 0.21637575X_5$

$\widehat{Y_5} = \hat{e_5}'X = 0.1176595X_1 - 0.5886080X_2 + 0.78030428X_3 + 0.1484555X_4 - 0.09371777X_5$

(b)

Our function above calculates the total sample variance (TSV) for us. All we need to do is divide the first three eigenvalues by the TSV, as done below:

eigenS <- eigen(pca$sample.Cov)
lambda <- eigenS$values
l1per <- lambda[1]/pca$TSV
l2per <- lambda[2]/pca$TSV
l3per <- lambda[3]/pca$TSV

From the work done above, we can see that the first principle component explains r l1per of total sample variance, the second principle component explains r l2per of the total sample variance, and the third principle component explains r l3per of the data. This means that the vast majority of the variation in the total variance in the data can be explained by the first principle component, with still a decent amount explained by the second, and then not very much explained by the third. I would also use a scree plot to look at all five principle components before making a definitive statement, but, I think that this is a good start to arguing that we can reduce the data to just the first two principle components.

(c)

This one is also simply a matter of using the function I made, called bf_int.

bf_int(stocks, q = 3, alpha = 0.1)

(d)

I do believe that it can be summarized in fewer than five dimensions. Especially from what we saw in (b) and (c), I think we see a dramatic decrease in the explanatory power of the principle components between the second principle component and the third. The purpose of this project is to answer the questions as given, but, personally, I would prefer to do a scree plot and get a better visual. But, that being said, I think that, based on what we see in (b) alone, we've got evidence that we can use the principle components to reduce the data to only two dimensions. At the very least, we can certainly reduce it to three, given that the three components together explain 89.88% of the data.

leahpom/MATH5793POMERANTZ documentation built on May 10, 2021, 9:52 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com