knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Correlation indicates the direction and strength of the relationship across variables.
knitr::include_graphics("pictures/correl/1.png")
knitr::include_graphics("pictures/correl/2.png")
knitr::include_graphics("pictures/correl/3.png")
Plotting reminder:
ggplot(data, aes(X, Y, color/fill = categorical)
We can additionally control the length of the X and Y axes with coord_cartesian()
.
library(rio) library(ggplot2) cleanup <- theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background = element_blank(), axis.line.x = element_line(color = "black"), axis.line.y = element_line(color = "black"), legend.key = element_rect(fill = "white"), text = element_text(size = 15)) exam <- import("data/exam_data.csv") liar <- import("data/liar_data.csv")
#from chapter 5 notes scatter <- ggplot(exam, aes(Anxiety, Exam)) scatter + geom_point() + xlab("Anxiety Score") + ylab("Exam Score") + cleanup
#from chapter 5 notes + coord_cartesian scatter <- ggplot(exam, aes(Anxiety, Exam)) scatter + geom_point() + xlab("Anxiety Score") + ylab("Exam Score") + cleanup + coord_cartesian(xlim = c(50,100), ylim = c(0,100)) #just example numbers, you would want to use the real scale of the data
Remember, our all-in-one statistical equation:
We have previously defined our model as the Mean and the Standard Deviation or Standard Error.
Now, we can use:
$Outcome_i = \beta X_i + Error_i$
Correlation is a type of simple standardized regression, which is what this equation represents.
This relationship can be found by calculating the covariance.
$$SD^2 = \frac {\sum(X_i-\bar{X})^2}{N-1}$$
$$SD^2 = \frac {\sum(X_i-\bar{X})(X_i-\bar{X})}{N-1}$$
$$Cov(x,y) = \frac {\sum(X_i-\bar{X})(Y_i-\bar{Y})}{N-1} $$
var(exam$Revise) var(exam$Exam)
cov(exam$Revise, exam$Exam) plot(exam$Revise, exam$Exam)
One solution: standardize it!
The standardized version of covariance is known as the correlation coefficient.
$$r = \frac{Cov(x,y)}{S_xS_y}$$
$$r = \frac{\sum(X_i-\bar{X})(Y_i-\bar{Y})}{(N-1)S_xS_y} $$
It varies between -1 and +1
It is an effect size
Coefficient of determination, $r^2$
cor(exam$Revise, exam$Exam)
Correlation is an effect size and can be used for statistical testing. What should I check for data screening?
For correlation, we might consider using:
We can use many different forms of this type of set up:
The main "rule" is that the two hypotheses are opposites and cover the entire range of possibilities. For example, here's an incompatible set up:
cor()
during data screening. cor() function will calculate:
cor(exam[ , -1], use="pairwise.complete.obs", method = "pearson") cor(exam[ , -1], use="pairwise.complete.obs", method = "kendall")
rcorr() function will calculate:
library(Hmisc) rcorr(as.matrix(exam[ , -1]), type = "pearson")
cor.test() function will calculate:
cor.test(exam$Revise, exam$Exam, method = "pearson")
The third-variable problem:
Direction of causality:
Dataset consisting of the World's Best Liar Competition:
Measures
str(liar)
Because this dataset has true ordinal data (first, second, etc.), the non-parametric statistics are more appropriate.
Spearman
with(liar, cor.test(Creativity, Position, method = "spearman"))
with(liar, cor.test(Creativity, Position, method = "kendall"))
as.numeric()
.What type of correlation should we use with binary predictors?
Which is which?
liar$Novice2 <- as.numeric(as.factor(liar$Novice)) str(liar) #we had to factor because of the character variable with(liar, cor.test(Creativity, Novice2)) plot(liar$Creativity, liar$Novice2)
How can I tell if these two correlation coefficients are significantly different?
First, you have to decide if the correlations are independent or dependent
list
format.Subset the data, then create a list.
library(cocor) new <- subset(liar, Novice == "First Time") old <- subset(liar, Novice == "Had entered Competition Before") ind_data <- list(new, old) cocor(~Creativity + Position | Creativity + Position, data = ind_data)
cocor(~ X + Y | Y + Z, data = data)
- Overlapping correlationX + Y | Q + Z
- Non-overlapping correlationcocor(~Revise + Exam | Revise + Anxiety, data = exam)
Partial correlation:
Semi-partial correlation:
knitr::include_graphics("pictures/correl/9.png")
library(ppcor) pcor(exam[ , -c(1)], method = "pearson")
spcor(exam[ , -c(1)], method = "pearson")
What have we learned?
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.