knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(practice)
library(tidyverse)
library(cluster)
library(factoextra)

Principle Component Analysis and K Means Clustering with the Practice Package!

Intro to the Practice Package

We were interested in doing two anaylses for a given data set: K-means clustering and principle component analysis. The Practice package has three functions to assist with conducting principal component analysis and K cluster analysis when given a dataset and additional arguments:

The Dataset

The dataset provided with this package is an NBA dataset which contains the following attributes for each player during the 2017 - 2018 season:

This dataset is combined with another dataset containing the corresposing salaries for each player.

load("~/repos/practice/data/NBA.rda")
head(NBA)
load("~/repos/practice/data/NBA_salary.rda")
head(NBA_salary)

Practicing K-means clustering and PCA with the Practice package!

Example 1: The iris dataset

Is it a tutorial without the iris dataset?

K-means clustering

K clustering is a method of grouping data in clusters based off of distances from a number of defined k groups. Optimizatal groupings are found by systematically setting cluster centers and minimizing overall distance from data points to centers.

{optimal_cluster}

Because k must be supplied before running the k-cluster. The first function {optimal_cluster} uses three methods to help determine the optimal number of clusters for any given dataset. To use optimal cluster, input the dataset you want to explore.

op <- optimal_cluster(iris)

The object created has three visualizations. The first is the 'elbow method' in which the optimal number of clusters is determined by when the total within sum of squares stops sharply declining.

op[[1]]

In this case, it appears to be 3 clusters is optimal.

The next graph is the silhouette method which determines the optimal number of clusters by determining what k has maximial silhouette width.

op[[2]]

In this case, it appears that 2 clusters is optimal.

Finally the gap statistic, determines optimal clustering by measureing intracluster variation.

op[[3]]

The gap statistic also shows 2 clusters is optimal.

{visualization}

With the information from the graphs in {optimal_cluster} you can input the numbers into {visualization} along with your data to visualize the groupings.

visualization(d = iris, n = 2:3)

Principle Component Analysis {PCA}

The function {PCA} which outputs a variety of information, to use the function input your dataset and then a grouping variable you would like to group by. including a graph to display the amount of variation explained for each principle component, the summary of the PCA, a graph showing how each variable in the dataframe varies across PC1 and PC2 ,and finally a graph showing each data point graphed by PC1 and PC2, grouped by a user imputed variable.

p <- PCA(d = iris, group = iris$Species)

The function outputs 4 pieces of information. The first two outputs show a graph visualizing how much variation is explained by each principle component (PC). The second shows how each variable in a dataframe varies across PC1 and PC2.

p[[2]]

The third piece of information in a summary table, which shows the standard deviation and variance explained by each PC.

p[[4]]

Finally, the fourth piece of information shows all data points plotted on PC1 and PC2, then grouped by a user determined variable.

Example 2: The NBA dataset

Now with the NBA dataset. First salary will be added to NBA

Player rating is a factor that might be interesting to group by after doing k-clustering or PCA. To reduce the number of groups rating players are grouped into high, medium, and low ratings.

NBA_f <- NBA %>% mutate(Rating = case_when(Rating < 80 ~ 1,
                                    Rating >= 80 & Rating < 90 ~ 2,
                                    Rating >= 90 ~ 3))
NBA_f <- NBA_f %>% mutate(Rating = factor(Rating, levels = c(1, 2, 3),
                                        labels = c("low", "medium", "high")))

Now to see how many clusters is optimal for the NBA dataset

optimal_cluster(NBA_f)

The elbow method shows that 4 clusters is the optimal number, the silouette 3, and the gap statistic 4. So we will pass 1-4 into visualization to see the groupings.

visualization(NBA_f, n = 1:4)

To compare how clusters separate using PCA. We will see how the vaiablility of in the statistics measured fits into groups based on postiion, player rating and salary.

PCA(NBA_f, group = NBA_f$Pos)
PCA(NBA_f, group = NBA_f$Rating)

Now looking at salary, there are ~50 players there is no salary data on. First tidying the data, making salary a factor, then running the PCA.

NBA_salary2 <- NBA_salary %>% select(Name, salary)
NBA2 <- left_join(NBA[-4], NBA_salary2, by = "Name")
NBA2 <- na.omit(NBA2)
NBA_f <- left_join(NBA_f[-4], NBA_salary2)
NBA_f <- NBA_f %>% mutate(salary = case_when(salary < 1000000 ~ 1,
                                    salary >= 1000000 & salary < 10000000 ~ 2,
                                    salary >= 10000000 ~ 3))
NBA_f <- NBA_f %>% mutate(salary = factor(salary, levels = c(1, 2, 3),
                                        labels = c("low", "medium", "high")))
NBA_f <- na.omit(NBA2)

PCA(NBA_f, group = NBA_f$salary)

Using all the functions

It is sometimes useful to do PCA then look at clusters. Here we show using all 3 functions to group based off of PC.

###PCA then k-cluster
PCANBA <- PCA(NBA_f, group = NBA2$Pos)
NBA3 <- cbind(NBA_f, PCANBA[[2]][[5]])

optimal_cluster(NBA3[, -4:-14])
visualization(NBA3[, -4:-14], n = 1:4)

PCA interpretation

Let's see how we can collapse the many player stats variables in the nba dataset into a few dimentions using a PCA. We can then compare the first few principical components to other variables like a player's rating, salary, and position.

We will take our nba data frame and select the variables we want to collapse. For a PCA we want to choose the variables that describe the player's performance on the court, so will select columns PTS:PF and use the function prcomp().

p <- NBA_f %>%
  select(PTS:PF) %>%
  PCA(group = Rating)

p[[6]]

Now we have loadings for each principal component. How much of the variation in player stats is captured by the first component? How much is captured in the first three? We will work with the first 4 compenents, which capture 83 % of the variation.

Let's plot the PC values for each variable for the first 4 components.

vars <- dimnames(p[[5]]$rotation[, c(1,2,3,4)])[[1]]
head(vars)
 p[[5]]$rotation[, c(1,2,3,4)] %>%
  as_tibble %>%
  mutate(variable = vars) %>% # adding a column called 'variable' with the name of the behavior you measured
  gather("PC", "loading", PC1:PC4) %>%
  ggplot(aes(x = variable, y = loading)) +
  geom_col() +
  theme_minimal() +
  facet_wrap(~PC) +
  coord_flip() +
  ggtitle("The loadings of each variable for the first 4 PCs")

What do you notice about how the different variables load into each principal component? Which variables come out strongest in each component? Each position in basketball fulfills slightly different roles. Centers are tall and stay close to the basket. Point guards are fast lead plays. Which principipal component might you expect point guards to load highly on? And which one would centers load highly on?

It seems that PC1 separates the best players from the rest. Players that load highly on PC1 are are good stealing the ball, rebounding, scoring points, assists, and makinng free throws. Aside from high turnovers, players on PC1 don't have any significant trade-offs in performance. Lets look at PC2. Players that score highly on PC2 make few rebounds, few personal fouls, few offensive rebounds, and few blocks, but the make a lot of 3 point shots. Players that score highly on PC3 steal the ball a lot, make higher turnovers, and have many assists, but have few blocks and 3 point shots. Fianlly, players high on PC4 rarely steal the ball, commit personal fouls, and make and attempt few 3 point shots but make a lot of free throws. Let's see if PC1 actually capatures the best players. If it does, it should correlate with Ranking and/or salary. We can use PCA here to test if the NBA's valuation of the players (Rating, salary, minutes played per game) matches with their in-game performance.

We'll first add the PCs to our nba dataframe

nba_with_loadings <- cbind(NBA2[1:4], p[[6]])
nba_with_loadings <- cbind(nba_with_loadings, NBA2[17])

Now lets see if the PCs correlate with Ranking to see how much each contributes to overall skill.

nba_with_loadings %>% ggplot(aes(x = PC1, y = Rating)) + geom_point() + geom_smooth()

fit <- lm(Rating ~ PC1, data = nba_with_loadings)
summary(fit)


nba_with_loadings %>% ggplot(aes(x = PC2, y = Rating)) + geom_point() + geom_smooth()

fit <- lm(Rating ~ PC2, data = nba_with_loadings)
summary(fit)


nba_with_loadings %>% ggplot(aes(x = PC3, y = Rating)) + geom_point() + geom_smooth()

fit <- lm(Rating ~ PC3, data = nba_with_loadings)
summary(fit)


nba_with_loadings %>% ggplot(aes(x = PC4, y = Rating)) + geom_point() + geom_smooth()

fit <- lm(Rating ~ PC4, data = nba_with_loadings)
summary(fit)

PC1 correlates very highly with ranking. The other PCs explain some of the rest of the variation in Ranking but only to a small degree. Lets see if PC1 correlates with other metrics of how much players are valued (salary, minutes per game).

nba_with_loadings %>% ggplot(aes(x = PC1, y = `MIN.G`)) + geom_point() + geom_smooth()

fit <- lm(`MIN.G` ~ PC1, data = nba_with_loadings)
summary(fit)

PC1 predicts minutes per game. How about salary?

nba_with_loadings %>% ggplot(aes(x = PC1, y = salary)) + geom_point() + geom_smooth()

fit <- lm(salary ~ PC1, data = nba_with_loadings)
summary(fit)

We can also use a PCA to identify how categorical variables differ in multidementional space. Each position is slightly different than the others, with center and point guard likely the biggest difference. Can we use the pca data too see how players of each position vary?

We'll use our nba dataframe with our 4 PCs to calculate means for each PC for each position.

 df_summ <- nba_with_loadings %>%
  group_by(Pos) %>%
  summarise(pca1_avg = mean(PC1), pca2_avg = mean(PC2), pca3_avg = mean(PC3), pca4_avg = mean(PC4))

Now we can visualize how different positions vary across components. What does the plot below suggest?

nba_with_loadings %>%
  ggplot(aes(x = PC1, y = PC2, color = Pos)) +
  geom_point(data = df_summ, aes(x = pca1_avg, y = pca2_avg, color = Pos), size = 5) +
  theme_minimal() +
  stat_ellipse() +
  ggtitle("Plot of PCA 1 vs. 2 loadings for NBA players ",
          subtitle = "Large points are the centroids for each position") 

It seems that PC1 and PC2 don't separeate players very well based on the position they play. We can see that the center position scores low on PC2, which (by our plots above) suggests they make and attempt few free throws and make many blocks and rebounds. The lack of separation across PC1 isn't too surprising given that PC1 seems to capture the best players (which may be found in any position). The other PCs involve trade-offs among variables and are more likely to separate by position. Lets see.

nba_with_loadings %>%
  ggplot(aes(x = PC2, y = PC3, color = Pos)) +
  geom_point(data = df_summ, aes(x = pca2_avg, y = pca3_avg, color = Pos), size = 5) +
  theme_minimal() +
  stat_ellipse() +
  ggtitle("Plot of PCA 2 vs. 3 loadings for NBA players ",
          subtitle = "Large points are the centroids for each position") 

Here we can see better separation, again with centers separating from other positions the most. The ellipses show 95% confidence intervals for each centroid. Centersnscore low on PC2 and PC3, which indicates they make few assists and steals but make a lot of blocks and rebounds relative to other positions. Point guards score high on PC2 and 3, which suggests they take and make a lot 3 point shots, assists, and steals, but few blocks and rebounds. Lets look at PC3 and PC4.

nba_with_loadings %>%
  ggplot(aes(x = PC3, y = PC4, color = Pos)) +
  geom_point(data = df_summ, aes(x = pca3_avg, y = pca4_avg, color = Pos), size = 5) +
  theme_minimal() +
  stat_ellipse() +
  ggtitle("Plot of PCA 1 vs. 3 loadings for NBA players ",
          subtitle = "Large points are the centroids for each position") 

We can visualize and analyze how each position varies along each PCA.

nba_with_loadings %>% ggplot(aes(x = Pos, y = PC1, color = Pos)) + geom_point() + geom_boxplot()
fit <- lm(PC1 ~ Pos, data = nba_with_loadings)
summary(fit)

No difference among positions on PC1. Not surprising, because the best players of each position are skilled across the board. How about the other PCs?

nba_with_loadings %>% ggplot(aes(x = Pos, y = PC2, color = Pos)) + geom_point() + geom_boxplot()
fit <- aov(PC2 ~ Pos, data = nba_with_loadings)
summary(fit)
TukeyHSD(fit)

Players in different positions significantly vary across PC2. A post-hoc Tukey test shows this is largely driven by differences betweeen centers and other positions. Players who score highly on PC2 make many and attempt many 3 pointers and assists but have few blocks and rebounds. So we can see here the center position is assoociated with few three pointers and many blocks and rebounds while the other positions make few blocks and rebounds but take moree 3 pointers. Total points loads strongly on PC1 but not on any other. PC2 shows how players in different positions play differently but still make the same number of points (PC1 scores). (Centers score lots of points after rebounds). Lets look at PC3.

nba_with_loadings %>% ggplot(aes(x = Pos, y = PC3, color = Pos)) + geom_point() + geom_boxplot()
fit <- aov(PC3 ~ Pos, data = nba_with_loadings)
summary(fit)
TukeyHSD(fit)

Players in different positions also vary across PC3, with the pattern largely driven by point guards varying with other positions. Players who score high on PC3 have relatively high steals and assists and low blocks and 3 pointers. And finally PC4:

nba_with_loadings %>% ggplot(aes(x = Pos, y = PC4, color = Pos)) + geom_point() + geom_boxplot()
fit <- aov(PC4 ~ Pos, data = nba_with_loadings)
summary(fit)

No difference among positions in PC4.



kt25399/practice documentation built on May 23, 2020, 7 a.m.