# Simulation-based inference with mosaic In mosaic: Project MOSAIC Statistics and Mathematics Teaching Utilities

library(mosaic)
require(grDevices)
require(datasets)
require(stats)
require(lattice)
require(grid)
require(mosaic)
require(mosaicData)
trellis.par.set(theme=col.mosaic(bw=FALSE))
trellis.par.set(fontsize=list(text=9))
options(keep.blank.line=FALSE)
options(width=70)
require(vcd)
require(knitr)
Use these data to construct a 90% confidence interval for the mean price (in \$1,000's) of used Mustangs. Students should learn to examine data before undertaking formal inference. Any of the standard R tools can be used for this. We advocate the use of ggformula-style graphics functions for two reasons: # . They use a syntax that is consistent with the syntax used in mosaic. # . They extend naturally to more complicated settings. gf_histogram( ~ Price, data = Mustangs)  The basic ggformula syntax here --- operator, "formula" involving ~, and the use of data = to specify the relevant data set --- will be used throughout mosaic. For example, here is the sample mean mean( ~ Price, data = Mustangs)  Those familiar with R may wonder why the statement has not been written as mean(Mustangs$Price)


You can, of course, carry out exactly the same calculation this way. By using

mean( ~ Price, data = Mustangs)


we are making graphical and numerical summaries fit into a common template and anticipating the next steps, including the introduction of additional operations such as randomization and modeling.

Resampling, also known as random selection with replacement, is an important operation in randomization. The resample() function performs this. It can be illustrated in a very basic way on a small set of numbers:

simple = c(1, 2, 3, 4, 5)
resample(simple)
resample(simple)
resample(simple)


When applied to a dataframe, resample() selects random rows. It's easy to demonstrate this with a statement such as:

resample(Mustangs)

set.seed(104)
head(resample(Mustangs))
cat("... and so on")


By reference to the case numbers in the left column, you can see that case 19 has been selected twice.

One resampling trial of the mean can be carried out with

mean( ~ Price, data = resample(Mustangs))


Even though a single trial is of little use, it's a nice idea to have students do the calculation to show that they are (usually) getting a different results with each resample.

Another trial can be carried out by repeating the command:

mean( ~ Price, data = resample(Mustangs))


Let's generate five more, with a single command:

do(5) * mean( ~ Price, data = resample(Mustangs))


Now conduct 2000 resampling trials^[2000 is a reasonable number of trials for illustration purposes, but Monte Carlo variability can be considerably reduced by using 10,000 or more. ], saving the results in an object called Mustangs.Price.boot:^[It is a good idea to adopt a naming convention for your bootstrap and randomization distributions. Repeatedly using the same name (e.g., Trials) can become very confusing and lead to errors from misidentifying which distribution one has. The name we have chosen here makes it clear which data set an variable are being considered and that we are creating a bootstrap distribution (as opposed to simulating a null distribution).]

mustangs_price_boot <- do(2000) * mean( ~ Price, data = resample(Mustangs))


Plotting this resampling distribution is straightforward, e.g.:

gf_histogram( ~ mean, data = mustangs_price_boot,
xlab="Mean Mustang Price (in thousand dollars)")


The simplest way to translate this distribution into a confidence interval is to apply the operator for that purpose:

confint(mustangs_price_boot, level = 0.90, method = "quantile")
confint(mustangs_price_boot, level = 0.90, method = "stderr")


The confint() function is a black box. In introducing it, an instructor might well want to show the calculations behind the confidence interval in some detail using general-purpose operations. confint() implements two basic approaches, one based on quantiles of the distribution and the other using normal theory.

# . Calculation of the 90% confidence interval using quantiles

qdata( ~ mean, c(.05, .95), data = mustangs_price_boot)
# alternative
cdata(~ mean, 0.90, data = mustangs_price_boot)


# . Using normal theory and the standard error.

First calculate the critical value $t_\star$ for the appropriate degrees of freedom. Or use $z_\star$, either for simplicity or to demonstrate how $t_\star$ differs. In either case, the confidence level needs to be converted to the appropriate tail probability.

tstar <- qt(.95, df = 24)
zstar <- qnorm(.95)


The resulting margin of error will be

tstar * sd( ~ mean, data = mustangs_price_boot)
zstar * sd( ~ mean, data = mustangs_price_boot)


There's little reason to repeat this set of commands every time one needs a confidence interval. That's where confint() comes in, abstracting the operation, allowing the user to specify a level and a method ("quantile" or "stderr") and avoiding the need to convert the level into a tail probability. The extent to which one should repeat the detailed steps of the margin-of-error calculation is a decision the instructor can make depending on local circumstances and the instructor's goals. R and mosaic supports both approaches.

### Lock problem 2: Testing a proportion (NFL Overtimes) {-}

The National Football League (NFL) uses an overtime period to determine a winner for games that are tied at the end of regulation time. The first team to score in the overtime wins the game. A coin flip is used to determine which team gets the ball first. Is there an advantage to winning the coin flip? Data from the 1974 through 2009 seasons show that the coin flip winner won 240 of the 428 games where a winner was determined in overtime. Treat these as a sample of NFL games to test whether there is sufficient evidence to show that the proportion of overtime games won by the coin flip winner is more than one half.

By and large, introductory students understand that, if first possession of the ball is unrelated to the game outcome, one would expect to see about half of the 428 games won by the team that won the coin flip. The question is whether the observed 240 wins is "about" half of the 428 games. Statistical inference compares the 240 to the $428/2 = 214$ in the context of sampling variation.

Using the built-in binomial distribution operators.

Generate a simulation where each trial is a random sample of 428 games from a world in which the null hypothesis holds true and see what proportion of trials give a result as or more extreme as the observed 240.

prop( ~ rbinom(1000, prob = 0.5, size = 428) >= 240)


The result indicates that it's very unlikely, if the null were true, that the coin flip winner would win 240 or more times.

The exact result differs slightly from one simulation to the next, but that variation can be made small by making the number of trials larger.

prop( ~ rbinom(1000, prob = 0.5, size = 428) >= 240)


In this case, it is tempting entirely to avoid the variation due to simulation by doing a deterministic probability calculation rather than the simulation. This exact reproducibility comes at a cost, though, the need to customize the cut-off point to reflect which tail of the distribution corresponds to "as or more extreme than 240." On the right side, this means subtracting 1 from the observed number.

xpbinom(239, prob = 0.5, size = 428)


Of course, R can handle all of these little details and do the entire calculation for us if we use the binom.test() function.

binom.test(x = 240, n = 428)


Even if you want to use the deterministic probability calculation (using either pbinom() or binom.test()), you might want to illustrate the logic with the random-number generator.

Explicitly simulating a coin flip.

Recognizing that coin flips are a staple of statistics courses, the mosaic package offers a function that simulates random coin tosses without the need to refer to the binomial distribution. Here is one trial involving flipping 428 coins:

do(1) * rflip(428)


We'll do 2000 trials, and count what fraction of the trials the coin toss winner (say, "heads") wins 240 or more of the 428 attempts:

nfl_null <- do(2000) * rflip(428)
prop( ~ heads >= 240, data = nfl_null)

gf_histogram( ~ heads, fill = ~ (heads >= 240), data = nfl_null)


The observed pattern of 240 wins is not a likely outcome under the null hypothesis. The shading added through the groups = option helps to visually reinforce the result.

### Lock problem 3: Permutation test of means from two groups (sleep and memory) {-}

In an experiment on memory (Mednicj et al, 2008), students were given lists of 24 words to memorize. After hearing the words they were assigned at random to different groups. One group of 12 students took a nap for 1.5 hours while a second group of 12 students stayed awake and was given a caffeine pill. The results below display the number of words each participant was able to recall after the break. Test whether the data indicate a difference in mean number of words recalled between the two treatments.

data(Sleep)


The Sleep group tends to have remembered somewhat more words on average:

mean(Words ~ Group, data = Sleep)
obs <- diff(mean(Words ~ Group, data = Sleep))
obs

gf_boxplot(Words ~ Group, data = Sleep)


To implement the null hypothesis, scramble the Group with respect to the outcome, Words:

diff(mean(Words ~ shuffle(Group), data = Sleep))


That's just one trial. Let's try again:

diff(mean(Words ~ shuffle(Group), data = Sleep))


To get the distribution under the null hypothesis, we carry out many trials:

set.seed(134) # make sure the result below is "typical"

sleep_null <- do(2000) * diff(mean(Words ~ shuffle(Group), data = Sleep))
gf_histogram( ~ Sleep, fill = ~ (Sleep >= obs), data = sleep_null,
binwidth = 0.4,
xlab = "Distribution of difference in means under the null hypothesis")


The one-sided p-value for this test is the proportion of
trials which yielded a result as extreme or more extreme as the observed difference. We can calculate the one-sided p-value for this test by summing up the number of trials which yielded a result as extreme or more extreme as the observed difference. Here only r sum(sleep_null$Sleep >= obs) of the 2000 trials gave a result that large, so the p-value is equal to r prop(~ sleep_null$Sleep >= obs). We conclude that it's unlikely that the two groups have the same mean word recall back in their respective populations.

### Lock problem 4: Bootstrapping a correlation {-}

The data on Mustang prices in Problem #1 also contains the number of miles each car had been driven (in thousands). Find a 95% confidence interval for the correlation between price and mileage.

We can follow essentially the same workflow to obtain a confidence interval for other statistics, including the correlation coefficient.^[Naive confidence intervals based on percentiles or bootstrap standard errors from small samples may not be as good as other methods (for example, the bootstrap t is preferred when generating a confidence interval for a mean), and the ability to easily create simulated bootstrap distributions does not guarantee that the resulting intervals will have the desired properties. Alternative methods and diagnostics are important topics not covered here.]

cor(Price ~ Miles, data = Mustangs)

mustangs_cor_boot <- do(2000) * cor(Price ~ Miles, data = resample(Mustangs))
quantiles <- qdata( ~ cor, c(.025, .975), data = mustangs_cor_boot)
quantiles

mustangs_hist <- mutate(mustangs_cor_boot,
colorval = cut(cor, c(-Inf, quantiles, Inf),
labels = c("Lower", "Middle", "Upper")))
gf_histogram( ~ cor, data = mustangs_hist, fill = ~ colorval, n = 50)
confint(mustangs_cor_boot)


## Moving beyond means and proportions

The Lock problems refer to the descriptive statistics encountered in many introductory applied statistics courses: means, proportions, differences in means and proportions. There are advantages, however, in introducing such basic statistics in the context of a more general framework: linear models.

Usually, people think about linear models in the context of regression. For instance, using the Lock Mustang price dataset, there is presumably a relationship between the price of a car and its mileage. Here's a graph:

gf_point(Price ~ Miles, data = Mustangs) %>%
gf_lm()


There's a pretty clear relationship here, which in the Lock problems was quantified with the correlation coefficient.

In regression, one finds a straight-line functional relationship between the two variables. The built-in R lm() function does the calculations:

lm(Price ~ Miles, data = Mustangs)


# . G. W. Cobb, The introductory statistics course: a Ptolemaic curriculum?,

Technology Innovations in Statistics Education, 2007, 1(1).

# . D.T. Kaplan, Statistical Modeling: A Fresh Approach, 2nd edition,

https://dtkaplan.github.io/SM2-bookdown/.

# . T. Speed, "Simulation", IMS Bulletin, 2011, 40(3):18.

## Try the mosaic package in your browser

Any scripts or data that you put into this service are public.

mosaic documentation built on Sept. 21, 2022, 9:12 a.m.