knitr::opts_chunk$set(comment=NA, fig.width=6, fig.height=6, echo = TRUE, eval = TRUE)
knitr::include_graphics("https://image.slidesharecdn.com/8-150928160000-lva1-app6892/95/8-testing-of-hypothesis-for-variable-amp-attribute-data-4-638.jpg?cb=1443456555")
In this practical you'll conduct hypothesis tests. By the end of this practical you will know how to:
Here are the main descriptive statistics functions we will be covering.
| Function| Description|
|:------|:--------|
| table()
| Frequency table |
| mean(), median(), mode()
| Measures of central tendency|
| sd(), range(), iqr(), var()
| Measures of variability|
| max(), min()
| Extreme values|
| summary()
| Several summary statistics |
Here are the main hypothesis test functions we will be covering.
| Function| Hypothesis Test| Additional Help |
|:------|:--------|:----|
| t.test()
| One and two sample t-test| https://bookdown.org/ndphillips/YaRrr/htests.html#t-test-t.test
| cor.test()
| Correlation test| https://bookdown.org/ndphillips/YaRrr/htests.html#correlation-cor.test
| chisq.test()
| Chi-Square test| https://bookdown.org/ndphillips/YaRrr/htests.html#chi-square-chsq.test
| aov(), TukeyHSD()
| ANOVA and post-hoc test| https://bookdown.org/ndphillips/YaRrr/anova.html#full-factorial-between-subjects-anova|
# ----------------------------------------------- # Examples of hypothesis tests on the ChickWeight data # ------------------------------------------------ library(tidyverse) chick <- as_tibble(ChickWeight) # Save a copy of the ChickWeight data as a tibble called chick # ----- # Descriptive statistics # ----- mean(chick$weight) # What is the mean weight? median(chick$Time) # What is the median time? max(chick$weight) # What is the maximum weight? table(chick$Diet) # How many observations for each diet? # ----- # 1-sample hypothesis test # ----- # Q: Is the mean weight of chickens different from 110? htest_A <- t.test(x = chick$weight, # The data alternative = "two.sided", # Two-sided test mu = 110) # The null hyopthesis htest_A # Print result names(htest_A) # See all attributes in object htest_A$statistic # Get just the test statistic htest_A$p.value # Get the p-value htest_A$conf.int # Get a confidence interval # ----- # 2-sample hypothesis test # ----- # Q: Is there a difference in weights from Diet 1 and Diet 2? htest_B <- t.test(formula = weight ~ Diet, # DV ~ IV alternative = "two.sided", # Two-sided test data = chick, # The data subset = Diet %in% c(1, 2)) # Compare Diet 1 and Diet 2 htest_B # Print result # ----- # Correlation test # ------ # Q: Is there a correlation between Time and weight? htest_C <- cor.test(formula = ~ weight + Time, data = chick) htest_C # A: Yes. r = 0.84, t(576) = 36.7, p < .001 # Q: Does the result hold when ONLY considering Diets 1 and 2? htest_D <- cor.test(formula = ~ weight + Time, data = chick, subset = Diet %in% c(1, 2)) # Only take data where Diet is 1 or 2 htest_D # A: Yes. r = 0.81, t(339) = 25.08, p < .001 # ----- # Chi-Square test # ------ # Q: Are there more observations from chicks on one diet versus another? htest_E <- chisq.test(x = table(chick$Diet)) # Input is a table of values htest_E # A: Yes, some diets are observed more than others. X2(3) = 52.6, p < .001 # ----- # ANOVA # ----- # Q: Is there an overall effect of diet on weight? Diet_aov <- aov(formula = weight ~ factor(Diet), # Run the anova data = chick) summary(Diet_aov) # Look at summary for overall test results TukeyHSD(Diet_aov) # Conduct post-hoc tests # A: Yes, there is an overall effect of diet on weight, F(3, 574) = 10.81, p < .001 # Furthermore, we find significant differences between diets 1-3, and diets 1-4 at the 0.05 level.
A. For this practical, we'll use the ACTG175
dataframe from the speff2trial
package, load the package with the library()
function. Also load the tidyverse
as always!
library(tidyverse) library(speff2trial)
B. Convert the data to a tibble (Hint, use assignment and as_tibble()
)
ACTG175 <- as_tibble(ACTG175)
C. First thing's first, take a look at the data by printing it. It should look like this
ACTG175
D. What was the mean age of all patients?
E. What was the median weight of all patients?
F. What was the mean CD4 T cell count at baseline? What was it at 20 weeks?
G. How many patients have a history of intraveneous drug use and how many do not? (Hint: use table()
)
t.test(x = ACTG175$age, alternative = "two.sided", mu = 40)
t.test(x = ACTG175$age, alternative = "two.sided", mu = 35)
A researcher wants to make sure that men and women in the clinical study are similar in terms of age. Conduct a two-sample t-test comparing the age of men versus women to test if they are indeed similar or not.
Women are coded as 0 in gender
, and men are coded as 1.
formula = age ~ gender
t.test(formula = age ~ gender, data = ACTG175, alternative = "two.sided")
days
) between those with a history of intravenous drug use (drugs
) and those without a history of intravenous drug uset.test(formula = days ~ drugs, data = ACTG175)
wtkg
) and age (age
). What is your conclusion?cor.test(formula = ~ age + wtkg, data = ACTG175)
cd40
) and at 20 weeks (cd420
). But how strong is the correlation? Answer this question by conducting a correlation test between CD4 T cell count at baseline (cd40
) and CD4 T cell count at 20 weeks (cd420
).cor.test(formula = ~ cd40 + cd420, data = ACTG175)
cd40
) and the number of days until the first occurrence of major negative event (days
)?cor.test(formula = ~ cd40 + days, data = ACTG175)
Only considering men, is there a correlation between CD4 T cell count at baseline (cd40
)and CD8 T cell count at baseline (cd80
)?
Include the argument subset = gender == 0
to restrict the analysis to men
cor.test(formula = ~ cd40 + cd80, data = ACTG175, subset = gender == 0)
cor.test(formula = ~ cd40 + cd80, data = ACTG175, subset = gender == 1)
Do men and women (gender
) have different distributions of race (race
)? That is, is the percentage of women who are white differ from the percentage of men who are white?
Be sure to create a table of gender and race values with table(ACTG175$gender, ACTG175$race)
chisq.test(table(ACTG175$gender, ACTG175$race))
drugs
) and hemophilia (hemo
)?chisq.test(table(ACTG175$hemo, ACTG175$drugs))
homo
) and gender (gender
)chisq.test(table(ACTG175$homo, ACTG175$gender))
Only for patients older than 40, is there a relationship between antiretroviral history (str2
) and race (race
)?
Create a new dataframe called ACTG175.o40 <- subset(ACTG175, age > 40)
and then do your analysis on this new dataframe.
ACTG175.o40 <- subset(ACTG175, age > 40) chisq.test(table(ACTG175.o40$str2, ACTG175.o40$race))
Now repeat the previous analysis, but only for male patients
Create a new dataframe called ACTG175.male <- subset(ACTG175, gender == 0)
and then do your analysis on this new dataframe.
ACTG175.male <- subset(ACTG175, gender == 0) chisq.test(table(ACTG175.male$str2, ACTG175.male$race))
arms
) on CD8 T cell count at 20 weeks (cd820
). If there is a significant effect, conduct post-hoc tests to see which treatment arms differed. arms_cd820_aov <- aov(formula = cd820 ~ factor(arms), data = ACTG175) summary(arms_cd820_aov) TukeyHSD(arms_cd820_aov)
arms
) on weight (wtkg
). If the effect is significant, conduct post-hoc tests.arms_weight_aov <- aov(formula = wtkg ~ factor(arms), data = ACTG175) summary(arms_weight_aov) TukeyHSD(arms_weight_aov)
arms
) on the number of days until the occurrence of a major negative event (days
). Answer this by conducting the appropriate ANOVA (with post-hoc tests if necessary).arms_days_aov <- aov(formula = days ~ factor(arms), data = ACTG175) summary(arms_days_aov) TukeyHSD(arms_days_aov)
Does the previous result hold if you only consider patients with a history of intravenous drug use (drugs
)? Answer this by conducting the same ANOVA only on these patients.
Create a new dataframe called ACTG175_drugs = subset(ACTG175, drugs == 1)
and run your analysis on this dataframe
ACTG175_drugs <- subset(ACTG175, drugs == 1) arms_days_drugs_aov <- aov(formula = days ~ factor(arms), data = ACTG175_drugs) summary(arms_days_drugs_aov) TukeyHSD(arms_days_drugs_aov)
?distributions
. For example, to generate samples from the well known Normal distribution, you can use rnorm()
. Look at the help menu for rnorm()
to see its arguments. ?rnorm
rnorm()
, create a new object samp_10
which is 10 samples from a Normal distribution with mean 10 and standard deviation 5. Print the object to see what the elements look like. What should the mean and standard deviation of this sample? be? Test it by evaluating its mean and standard deviation directly using the appropriate functions. Then, do a one-sample t-test on this sample against the null hypothesis that the true mean is 12. What are the results? samp_10 <- rnorm(n = 10, mean = 10, sd = 5) t.test(x = samp_10, mu = 12)
samp_10
and the new mean, standard deviation, and t-test result. Why are the new results different?samp_10 <- rnorm(n = 10, mean = 10, sd = 5) t.test(x = samp_10, mu = 12)
samp_1000
which is 1,000 samples from a Normal distribution (again with mean 12 and standard deviation 5). Print this object to see what it looks like. What should the mean and standard deviation of this sample be? Do the same hypothesis test as you did in the previous question. What is your new p-value?samp_1000 <- rnorm(n = 1000, mean = 10, sd = 5) t.test(x = samp_1000, mu = 12)
Conduct a two-way ANOVA testing the effects of both hemophilia (hemo
) and drug use (drugs
) on the number of days until a major negative event.
To include multiple factors in an anova, just include both in the formula such as: formula = dv ~ factor(x) + factor(y) + ...
. See https://bookdown.org/ndphillips/YaRrr/anova.html#ex-two-way-anova for an example
hemo_drugs_days_aov <- aov(formula = days ~ factor(hemo) + factor(drugs), data = ACTG175) summary(hemo_drugs_days_aov)
Repeat the previous ANOVA, but now test if there is an interaction between hemophilia and drugs on the number of days until a major negative event.
To include interactions in an ANOVA, just include both in the formula using the *
operator: formula = dv ~ factor(x) * factor(y)
. See https://bookdown.org/ndphillips/YaRrr/anova.html#ex-two-way-anova for an example
hemo_drugs_days_aov <- aov(formula = days ~ factor(hemo) * factor(drugs), data = ACTG175) summary(hemo_drugs_days_aov)
t.test(formula = cd40 ~ race, data = ACTG175, alternative = "two.sided")
t.test(formula = days ~ arms, data = ACTG175, subset = arms %in% c(0, 3))
cor.test(formula = ~ cd40 + days, data = subset(ACTG175, race == 0)) cor.test(formula = ~ cd40 + days, data = subset(ACTG175, race == 1))
chisq.test(table(ACTG175$gender, ACTG175$arms)) chisq.test(table(ACTG175$race, ACTG175$arms)) chisq.test(table(ACTG175$drugs, ACTG175$arms))
For more details on hypothesis tests in R, check out the chapter on hypothesis tests in YaRrr! The Pirate's Guide to R YaRrr! Chapter Link
For more advanced mixed level ANOVAs with random effects, consult the afex
and lmer
packages.
To do Bayesian versions of common hypothesis tests, try using the BayesFactor
package. BayesFactor Guide Link
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.