knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(URStat218)
All notes in the vignettes of this package are adapted from Dr. Joseph Ciminelli's fall 2019 section of Statistics 218. For a more complete version of the material herein contained, please reference the notes, attend the lectures, or ask Dr. Ciminelli himself.
We begin by taking a contingency table on X and Y, where the rows are fixed:
|X (Explanatory)|Y(Response) | | | |:--: |:--: |:--: |:--: | | |1 |2 | Total | |1 | $n_{11}$ | $n_{12}$ | $n_{1+}$ | |2 | $n_{21}$ | $n_{22}$ | $n_{2+}$ |
For this 2x2 table, we have two independent binomial proportions, one for each level of X:
$$n_{11} \sim \text{Binom}(n_{1+},p_{1|1})$$
$$n_{21} \sim \text{Binom}(n_{2+},p_{1|2})$$
Ultimately, we wish to test the significance of the difference in the two groups' proportions. If there is a significant difference, we conclude X and Y to be significantly associated, while the opposite leads to a correspondingly opposite conclusion.
As we are testing for independence, with unknown probabilities $p_{1|1}$ and $p_{1|2}$, we test against a null hypothesis of $H_0: p_{1|1} =p_{1|2}$; that is, that the proportions are independent and equal. Of the methods available to us, we note that R has built-in support for a difference-of-proportions test.
An equivalent expression for our null hypothesis is $H_0:\frac{p_{1|1}}{p_{1|2}}=1$. This quantity is known as the relative risk. In this form, the relative risk is describing a population based on population parameters. In order to estimate the population relative risk, we calculate sample relative risk as: $r=\frac{{\hat{p}{1|1}}}{\hat{p}{1|2}}$. In order to preserve the distribution of the population, we take the log (that is, the natural log) of the relative risk:
$$\text{log}\left(\frac{p_{1|1}}{p_{1|2}}\right)=\text{log}({p_{1|1}})-\text{log}({p_{1|2}})$$
and
$$\text{log}\left(\frac{\hat{p}{1|1}}{\hat{p}{1|2}}\right)=\text{log}(\hat{p}{1|1})-\text{log}(\hat{p}{1|2})$$
We note here that our estimator $\text{log}(r)$ is actually a biased estimator, with a better estimator for relative risk existing as
$$\text{log}(\tilde{r})=\text{log}\left(\frac{n_{11}+1/2}{n_{1+}+1/2}\right)-\text{log}\left(\frac{n_{21}+1/2}{n_{2+}+1/2}\right)$$
For the purposes of this course, and therefore this package, however, $\text{log}(r)$ proves to be sufficient.
In order to actually test $H_0$, we create a confidence interval on our \text{log} scale. To do this, we take the standard error of $\text{log}(r),$ assuming the sample size is sufficiently large, to be:
$$\hat{\sigma}(\text{log}(r))=\sqrt{\frac{n_{12}}{n_{11}n_{1+}}+\frac{n_{22}}{n_{21}n_{2+}}}$$
Again assuming a large enough sample size, we approximate the distribution of $\text{log}(r)$ to be normal with the $(1-\alpha)\times 100\%$ confidence interval to be:
$$\text{log}(r) \pm z_{\alpha/2}\sqrt{\frac{n_{12}}{n_{11}n_{1+}}+\frac{n_{22}}{n_{21}n_{2+}}}$$
where $z_{\alpha/2}$ is the normal z-score at $\frac{\alpha}{2}.$
We can then back-transform this interval to put it back on our original scale. We do this by simply exponentiating both cases, finding the interval to be:
$$\left(e^{\text{log}(r) -z_{\alpha/2}\sqrt{\frac{n_{12}}{n_{11}n_{1+}}+\frac{n_{22}}{n_{21}n_{2+}}}}, e^{\text{log}(r) +z_{\alpha/2}\sqrt{\frac{n_{12}}{n_{11}n_{1+}}+\frac{n_{22}}{n_{21}n_{2+}}}}\right)$$
If 1 is included in this interval, we fail to reject $H_0.$ If 1 is not included, we reject $H_0,$ and conclude that there is association between our two variables.
Consider the table
|Party (X)|Universal Health Care (Y) | | | |:--: |:--: |:--: |:--: | | |Support |Oppose | Total | |Democrat | 491 | 813 | 1304 | |Republican | 740 | 558 | 1298 |
Our $H_0$ for this test is that a person's views on universal health care are independent of that person's political party, with $H_1$ being a statement of association between the two. In notation, these would be:
$$H_0: \frac{\hat{p}{1|1}}{\hat{p}{1|2}}=1$$
and
$$H_1: \frac{\hat{p}{1|1}}{\hat{p}{1|2}} \neq 1.$$
We calculate $\hat{p}{1|1}$ and $\hat{p}{1|2}$ to be 0.377 and 0.57, respectively, using our definitions from above. We then calculate r as their quotient, finding it to be 0.661, with $(r)=-.414.$ Testing at the $\alpha=0.05$ significance level, we use the standard $z_{\alpha/2}=1.96$ to set up our interval: $$-0.414 \pm 1.96\sqrt{\frac{813}{491\cdot1304}+\frac{558}{740\cdot1298}}$$ Recall, however, that this interval is a log scale and that we must exponentiate both bounds to get it on our desired scale. While we skip the algebra, we evaluate our bounds, on a linear scale, to be approximately (0.608, 0.719).
We conclude, therefore, that we can be 95\% confident that the true population relative risk lies between 0.608 and 0.719. As 1 is not within this interval, we reject $H_0$ and conclude that one's opinion of universal health care is associated with their political party. Further, we are 95\% confident that Republicans are between 0.608 and 0.719 times as likely to support unviersal health care as Democrats.
The function relrisk
performs this calculation for us. At its default, relrisk
performs this test at a 95\% confidence interval, conf.level, on some 2x2 table, tab, defined by the user. It begins by assigning each cell of the table to a variable (n11, n12, and so on) and storing the marginal totals (n11+n12 and n21+n22) as n1 and n2, respectively. The sample proportions, $p_{1|1}$ and $p_{1|2}$ are then calculated and stored as p1 and p2, respectively. From these, r is calculated, along with $\alpha$ and z. The log-odds scale bounds are then calculated by simply plugging the stored values into the equation as given above. Original-scale bounds are then caculated by exponentiating each of the log-odds scale bounds. These are then returned to 5 decimal places, along with the confidence level tested at.
In order to use the relrisk
function, we first need to define a table. Note that this table could be left undefined, and be simply passed through the function itself; for the sake of readability, we will define it as a unique variable. This is done simply with
tab <- matrix(c(491,740,813,558), nrow=2)
We then proceed with calculating the interval using relrisk
at the default confidence level:
relrisk(tab)
As the numeric values are identical to those in the non-R example, we draw the same conclusions.
\medskip
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.