Table of Contents
The $t$-statistic is a standardized measure of the magnitude of difference between a sample's mean and some known, non-random constant. It is similar to a $z$-statistic, but differs in that a $t$-statistic may be calculated without knowledge of the population variance.
Let $\theta$ be a sample parameter from a sample with standard deviation $s$. Let $\theta_0$ be a constant, and $s_\theta = s/\sqrt{n}$ be the standard error of the parameter $\theta$. $t$ is defined: $$t = \frac{\theta - \theta_0}{s_\theta} = \frac{\theta - \theta_0}{\frac{s}{\sqrt{n}}}$$
The confidence interval for $\theta$ is written: $$\theta \pm t_{1-\alpha/2} \cdot \frac{s}{\sqrt{n}}$$
The value of the expression on the right is often referred to as the margin of error, and we will refer to this value as $$E = t_{1-\alpha/2} \cdot \frac{s}{\sqrt{n}}$$
I'm not sure what I had intended to write here. Probably ideas about when to estimate $n$, and when to estimate other parameters. But that may be a discussion for a separate vignette, as it applies to almost every function in this package.
The following example is taken from the Wiley text, page 188:
A health department nutritionist, wishing to conduct a survey among a population of teenage girls to determine their average daily protein intake (measured in grams), is seeking the advice of a biostatistician relative to the sample size that should be taken.
Let us assume that the nutritionis would like an interval about 10 grams wide; that is, the estimate shoudl be within about 5 grams of the population mean in either direction. In other words, a margin of error of 5 grams is desired. Let us also assume that a confidence coefficient of .95 is decided on and that, from past experience, the nutrionist feels that the population standard deviation is probably about 20 grams.
Note: the results in our example will differ slightly from the results in the text as we are using the t-interval instead of the z-interval.
An immediate solution to the request can be provided using
library(StudyPlanning) interval_t1(E=5, s=20, alpha=.05)
Suppose, however, that we aren't sure that the standard deviation is 20, but could be anywhere from 10 to 25. In this case, we may wish to explore the effect the changing standard deviation will have on the sample size. We'll also explore the 90% and 99% confidence intervals as well.
library(StudyPlanning) library(ggplot2) SampSize <- interval_t1(E=5, s=seq(10, 25, by=1), alpha=c(.10, .05, .01)) ggplot(SampSize, aes(x=s, y=n, colour=factor(alpha))) + geom_line()
A larger value of the standard deviation has less of an impact on the 90% confidence interval than it does on the 99% confidence interval.
In this instance, let's consider the possibility that the nutritionist can only afford to sample 50 girls in the population. What is the margin of error that nutritionist can reasonably expect to capture in this sample?
library(StudyPlanning) library(ggplot2) SampSize <- interval_t1(E=NULL, n=50, s=seq(10, 25, by=1), alpha=c(.10, .05, .01)) ggplot(SampSize, aes(x=s, y=E, colour=factor(alpha))) + geom_line()
Using the 95% confidence interval with the initial estimate of the standard deviation ($s=20$), the researcher can expect to estimate the protein intake to within r SampSize$E[SampSize$s==20 & SampSize$alpha==.05]
grams. With a standard deviation of $s=15$, the margin of error would reduce to
r SampSize$E[SampSize$s==15 & SampSize$alpha==.05]
A more likely scenario is that we know we may be able to afford to sample 75 girls, and we know that we wish to estimate the protein intake to within 5 grams. What we may not know is the standard deviation of protein intake. In this case, we may prefer to explore the magnitude of the standard deviation. If a very small standard deviation is required, the study may not be feasible.
library(StudyPlanning) library(ggplot2) SampSize <- interval_t1(E=seq(3, 7, by=.01), n=75, s=NULL, alpha=c(.10, .05, .01)) ggplot(SampSize, aes(x=E, y=s, colour=factor(alpha))) + geom_line()
Based on these results, we may conclude that we can estimate the protein intake to within 5 grams as long as the observed standard deviation is less than about 25.
Suppose we know we can afford to sample 75 girls, and we expect the standard deviation to be between 15 and 25 (based on a confidence interval from a previous study). Let us consider the level of confidence we can have in our estimate based on these parameters.
library(StudyPlanning) library(ggplot2) SampSize <- interval_t1(E=c(5, 7.5, 10), n=75, s=15:25, alpha=NULL) ggplot(SampSize, aes(x=s, y=1-alpha, colour=factor(E))) + geom_line() + xlab("Confidence")
The results show that our confidence is very high when we estimate to within 10 grams. If we are estimating to within 5 grams, our confidence my dip below 95%, but we can expect it to remain above 90%.
If a one-sided confidence interval is desired, we only need to multiply our desired alpha by 2. Thus, for a one-sided confidence interval of the original example, only 46 subjects are needed (as shown below).
library(StudyPlanning) interval_t1(E=5, s=20, alpha=.05*2)
$$E = t_{1-\alpha/2} \cdot \frac{s}{\sqrt{n}}$$ $$\frac{E}{t_{1-\alpha/2}} = \frac{s}{\sqrt{n}}$$ $$\frac{E}{t_{1-\alpha/2} \cdot s} = \frac{1}{\sqrt{n}}$$ $$\frac{t_{1-\alpha/2} \cdot s}{E} = \sqrt{n}$$ $$\frac{t_{1-\alpha/2}^2 \cdot s^2}{E^2} = n$$
Since $t_{1-\alpha/2}$ depends on the value of $n$, this is not a problem that is easily reduced to a solution. Many texts encourage using $z_{1-alpha/2}$ as a substitute, but we're using computers here, so we can probably do a little better. Instead, if we write the last line as: $$\frac{t_{1-\alpha/2}^2 \cdot s^2}{E^2} - n = 0$$ $$\big(\frac{t_{1-\alpha/2}^2 \cdot s^2}{E^2} - n\big)^2 = 0$$
We now have a quadratic equation. We'll use the optimize
function in the stats
package to find a best solution for $n$.
Consider when we have $n=25$, $s=4$ and $\alpha=.05$. The value of $E$ here is $$E = t_{1-\alpha/2} \cdot \frac{s}{\sqrt{n}} = 2.063899 \cdot 4/5 = 1.651119$$.
Now let's rewrite the problem to solve for $n$ using optimize
.
fn <- function(n) (qt(1-.05/2, n-1)^2 * 4^2 / 1.651119^2 - n)^2 optimize(fn, c(0, 100))
On the other hand, using the $z$ approximation yields
qnorm(1-.05/2)^2 * 4^2 / 1.651119^2
which is two subjects short of what we would actually need. n_t1samp_interval
uses the optimize
function and searches over the values 0 to 1,000,000,000.
$$E = t_{1-\alpha/2} \cdot \frac{s}{\sqrt{n}}$$ $$\frac{E}{t_{1-\alpha/2}} = \frac{s}{\sqrt{n}}$$ $$\frac{E \cdot \sqrt{n}}{t_{1-\alpha/2}} = s$$
$$E = t_{1-\alpha/2} \cdot \frac{s}{\sqrt{n}}$$ $$\frac{E \cdot \sqrt{n}}{s} = t_{1-\alpha/2}$$ $$\Phi_t\Big(\frac{E \cdot \sqrt{n}}{s}\Big) = \Phi_t(t_{1-\alpha/2})$$ $$\Phi_t\Big(\frac{E \cdot \sqrt{n}}{s}\Big) = 1 - \frac{\alpha}{2}$$ $$1 - \cdot \Phi_t\Big(\frac{E \cdot \sqrt{n}}{s}\Big) = \frac{\alpha}{2}$$ $$2 \cdot \Big[1 - \Phi_t\Big(\frac{E \cdot \sqrt{n}}{s}\Big)\Big] = \alpha$$
Daniel Wayne W., Biostatistics: A Foundation For Analysis in the Health Sciences, John Wiley & Sons, Inc., New York, 4th ed. 2005, (Chapter 6)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.