require(lubridate) require(dplyr) require(mosaic) require(mosaicData) theme_set(theme_bw()) trellis.par.set(theme = col.mosaic()) require(knitr) opts_chunk$set( size = 'tiny', tidy = FALSE, fig.width = 6, fig.height = 3, fig.align = "center", out.width = "70%" ) options(width = 90)
mosaic
packageNSF-funded project to develop a new way to introduce mathematics, statistics, computation and modeling to students in colleges and universities.
more information at mosaic-web.org
the mosaic
package is available via
This document was originally created as an R presentation to be used as slides
accompanying various presentations. It has been converted into a more traditional
document for use as a vignette in the mosaic
package.
The examples below use the mosaic
and mosaicData
packages. An earlier version
of this document used lattice
graphics, but it has been updated to use ggformula
library(mosaic) # loads mosaicData and ggformula as well
Many of the guiding principles of the mosaic
package reflect the
"Less Volume, More Creativity" mantra of Mike McCarthy who had a large
poster with those words placed in the "war room" (where assistant coaches
decide on the game plan for the upcoming opponent) as a constant reminder
not to add too much complexity to the game plan.
A lot of times you end up putting in a lot more volume, because you are teaching fundamentals and you are teaching concepts that you need to put in, but you may not necessarily use because they are building blocks for other concepts and variations that will come off of that ... In the offseason you have a chance to take a step back and tailor it more specifically towards your team and towards your players."
Mike McCarthy, former Head Coach, Green Bay Packers |
Here is another elegant phrasing of a similar principle.
Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.
--- Antoine de Saint-Exupery (writer, poet, pioneering aviator) |
One key to successfully introducing R is finding a set of commands that is
It is not enough to use R, it must be used elegantly.
Two examples of this principle:
Goal: a minimal set of R commands for Intro Stats
Result: Minimal R Vignette (vignette("MinimalR")
)
Much of the work on the mosaic
package has been motivated
by
If you (or your students) are just getting started with R, it is good to keep the following in mind:
The following template is important because we can do so much with it.
It is useful to name the components of the template:
We're hiding a bit of complexity in the template, and there will be times
that we will want to gussy things up a bit. We'll indicate that by adding
...
to the end of the template. Just don't let ...
become a distractor
early on.
Here are some variations on the template.
# simpler version goal(~ x, data = mydata) # fancier version goal(y ~ x | z , data = mydata) # unified version goal(formula, data = mydata)
Using the template generally requires answering two questions. (These questions are useful in the context of nearly all computer tools, just substitute "the computer" in for R in the questions.)
gf_point(births ~ date, data = Births78)
gf_point()
)births ~ date
)data = Births78
)?Births78
for documentationgf_point(births ~ date, data = Births78)
gf_boxplot(age ~ substance, data = HELPrct, xlab = "substance")
Some things you will need to know:
Command: gf_boxplot()
The data: HELPrct
age
, substance
?HELPrct
for info about datagf_boxplot(age ~ substance, data=HELPrct)
gf_boxploth(substance ~ age, data = HELPrct)
Some things you will need to know:
Command: gf_boxploth()
for horizontal boxplots
The data: HELPrct</code
age
, substance
?HELPrct
for info about datagf_boxploth(substance ~ age, data = HELPrct)
Note that we have reversed which variable is mapped to the x-axis and which
to the y-axis by reversing their order in the formula and using
gf_boxploth()
instead of gf_boxplot()
.
gf_histogram(~ age, data = HELPrct)
Note: When there is one variable it is on the right side of the formula.
gf_histogram( ~ age, data = HELPrct) gf_density( ~ age, data = HELPrct) gf_boxplot( ~ age, data = HELPrct) gf_qq( ~ age, data = HELPrct) gf_freqpoly( ~ age, data = HELPrct)
gf_point(i1 ~ age, data = HELPrct) gf_boxplot(age ~ substance, data = HELPrct)
Note: i1
is the average number of drinks (standard units)
consumed per day in the past 30 days (measured at baseline)
gf_histogram()
, gf_qq()
, gf_density()
, gf_freqpoly()
gf_point()
, gf_line()
, gf_boxplot()
Create a plot of your own choosing with one of these data sets
names(KidsFeet) # 4th graders' feet ?KidsFeet
names(Utilities) # utility bill data ?Utilities
require(NHANES) # load package names(NHANES) # body shape, etc. ?NHANES
Add color = ~
group
or fill = ~
group
to overlay with different colors.
Use y ~ x | z
to create multipanel plots.
Here is an example.
gf_density( ~ age | sex, data = HELPrct, fill = ~ substance)
Beginners can create plots with 3 or 4 variables easily and quickly using this template.
The ggformula
graphics system includes lots of bells and whistles including
I used to introduce these too early. My current approach:
library(lubridate) Births78 <- Births78 %>% mutate(weekday = wday(date, label = TRUE, abbr = TRUE)) gf_line(births ~ date, color = ~ weekday, data = Births78)
Notes
wday()
is in the lubridate
packageThe mosaic
package provides functions that make it simple to create
numerical summaries using the same template used for graphing (and later for
describing linear models).
Big idea:
gf_histogram( ~ age, data = HELPrct) # binwidth = 5 (or 10) might be good here mean( ~ age, data = HELPrct)
The mosaic package includes formula aware versions of
mean()
,
sd()
,
var()
,
min()
,
max()
,
sum()
,
IQR()
, ...
Also provides favstats()
to compute our favorites.
favstats( ~ age, data = HELPrct)
favstats()
quickly becomes a go-to function in our courses.
df_stats()
is similar, but
df_stats( ~ age, data = HELPrct) df_stats( ~ age, data = HELPrct, mean, sd, median, iqr)
tally(~ sex, data = HELPrct) tally(~ substance, data = HELPrct) df_stats(~ substance, data = HELPrct, counts, props)
There are three ways to think about this. All do the same thing.
sd(age ~ substance, data = HELPrct) sd(~ age | substance, data = HELPrct) sd(~ age, groups = substance, data = HELPrct) # note option color = ~ substance is used for graphics
sd(~ age, groups = substance, data = HELPrct)
This makes it possible to easily convert three different types of plots into the (same) corresponding numerical summary.
df_stats()
can also be used with multiple variables
and provides a different output format.
df_stats(age ~ substance, data = HELPrct, sd)
2-way tables are just tallies of 2 variables.
tally(sex ~ substance, data = HELPrct) tally( ~ sex + substance, data = HELPrct) df_stats(sex ~ substance, data = HELPrct, counts)
Other output formats are available
tally(sex ~ substance, data = HELPrct, format = "proportion") tally(substance ~ sex, data = HELPrct, format = "proportion", margins = TRUE) tally(~ sex + substance, data = HELPrct, format = "proportion", margins = TRUE) tally(sex ~ substance, data = HELPrct, format = "percent") df_stats(sex ~ substance, data = HELPrct, props, percs)
HELPrct <- mutate(HELPrct, sex = factor(sex, labels = c('F','M')), substance = factor(substance, labels = c('A', 'C', 'H')))
mean(age ~ substance | sex, data = HELPrct) mean(age ~ substance | sex, data = HELPrct, .format = "table")
rm(HELPrct) data(HELPrct)
mutate()
(in the dplyr
package) or transform()
.median()
, min()
, max()
, sd()
, var()
, favstats()
, etc.This master template can be used to do a large portion of what needs doing in an Intro Stats course.
mean(age ~ sex, data = HELPrct) gf_boxplot(age ~ sex, data = HELPrct) lm(age ~ sex, data = HELPrct)
mean(age ~ sex, data = HELPrct) coef(lm(age ~ sex, data = HELPrct))
It can be learned early and practiced often so that students become secure in their ability to use these functions.
The mosaic
package includes some other things, too
mosaicData
and NHANES
packages)xchisq.test()
, xpnorm()
, xqqmath()
x
mplot()
mplot(HELPrct)
interactive plot creationplot()
in some situationsgf_histogram()
controls (e.g., binwidth
)gf_refine()
)xpnorm(700, mean = 500, sd = 100)
xpnorm(c(300, 700), mean = 500, sd = 100)
phs <- cbind(c(104,189),c(10933,10845)) colnames(phs) <- c("heart attack","no heart attack") rownames(phs) <- c("aspirin","placebo")
xchisq.test(phs)
Modeling is really the starting point for the mosaic
design.
lm()
and glm()
) defined the templatelattice
graphics use the template (so we chose lattice
)model <- lm(width ~ length * sex, data = KidsFeet) Width <- makeFun(model) Width(length = 25, sex = "B") Width(length = 25, sex = "G")
Once models have been converted into functions, we can easily add them
to our plots using plotFun()
.
gf_point(width ~ length, data = KidsFeet, color = ~ sex) %>% gf_fun(Width(length, sex = "B") ~ length, color = ~"B") %>% gf_fun(Width(length, sex = "G") ~ length, color = ~"G")
theme_set(theme_bw())
If you want to teach using randomization tests and bootstrap intervals,
the mosaic
package provides some functions to simplify creating the
random distirubtions involved.
Often used on first day of class
Story
woman claims she can tell whether milk has been poured into tea or vice versa.
Question: How do we test this claim?
require(mosaic) trellis.par.set(theme = col.mosaic()) theme_set(theme_bw()) require(knitr) opts_chunk$set(size = 'small', cache = TRUE) options(width = 90) set.seed(12345)
We use rflip()
to simulate flipping coins
rflip()
Note: We do this with students who do not (yet) know what a binomial
distribution is, so we want to avoid using rbinom()
at this point.
Rather than flip each coin separately, we can flip multiple coins at once.
rflip(10)
heads
= correct; tails
= incorrect than to compare with a given patternrflip(10)
simulates 1 lady tasting 10 cups 1 time.
We can do that many times to see how guessing ladies do:
do(2) * rflip(10)
do()
is clever about what it remembers (in many common situations)Now let's simulate 5000 guessing ladies
Ladies <- do(5000) * rflip(10) head(Ladies, 2) gf_histogram(~ heads, data = Ladies, binwidth = 1)
Q. How often does guessing score 9 or 10?
Here are 3 ways to find out
tally( ~ (heads >= 9), data = Ladies) tally( ~ (heads >= 9), data = Ladies, format = "prop") prop( ~ (heads >= 9), data = Ladies)
The Lady Tasting Tea illustrates a 3-step process that can be reused in many situations:
Do it lots of times for "random" data
definition of "random" is important, but can often be handled by the mosaic
functions shuffle()
or resample()
diffmean(age ~ sex, data = HELPrct) do(1) * diffmean(age ~ shuffle(sex), data = HELPrct) Null <- do(5000) * diffmean(age ~ shuffle(sex), data = HELPrct)
prop( ~ (abs(diffmean) > 0.7841), data = Null) gf_histogram( ~ diffmean, data = Null) %>% gf_vline(xintercept = -0.7841)
Bootstrap <- do(5000) * diffmean(age ~ sex, data = resample(HELPrct)) gf_histogram( ~ diffmean, data = Bootstrap) %>% gf_vline(xintercept = -0.7841)
cdata( ~ diffmean, data = Bootstrap, p = 0.95) confint(Bootstrap, method = "quantile") confint(Bootstrap) # default uses bootstrap st. err.
do(1) * lm(width ~ length, data = KidsFeet) do(3) * lm(width ~ shuffle(length), data = KidsFeet)
do(1) * lm(width ~ length + sex, data = KidsFeet) do(3) * lm(width ~ length + shuffle(sex), data = KidsFeet)
Null <- do(5000) * lm(width ~ length + shuffle(sex), data = KidsFeet) gf_histogram( ~ sexG, data = Null, boundary = -0.2325) %>% gf_vline(xintercept = -0.2325)
gf_histogram(~ sexG, data = Null, boundary = -0.2325) %>% gf_vline(xintercept = -0.2325) prop(~ (sexG <= -0.2325), data = Null)
More mosaic resources can be found at https://www.mosaic-web.org/mosaic/articles/mosaic-resources.html.
The RJournal paper entitled "mosaic Package: Helping Students to `Think with Data' Using R (https://journal.r-project.org/archive/2017/RJ-2017-024/index.html) provides further discussion of the mosaic modeling language and approach to teaching.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.